NOTICE 


THIS DOCUMENT HAS BEEN REPRODUCED FROM 
MICROFICHE. ALTHOUGH IT IS RECOGNIZED THAT 
CERTAIN PORTIONS ARE ILLEGIBLE, IT IS BEING RELEASED 
IN THE INTEREST OF MAKING AVAILABLE AS MUCH 
INFORMATION AS POSSIBLE 



JPL PUBLICATION 80-73 


(HASi>CS>163986) PAOLT-TOIBBAMT COHPOTBR M 01- 18675 

SX0D7 Pinal Bapoct (Jet Ptopolaion Lab*) 

238 p HC ill/HP BOl CSCL 098 

Onclas 

63/60 41574 


Fault-Tolerant Computer Study 

Final Report 

David A. Rennels 
Algirdas A. Avizienis 
Milos D. Ercegovac 


February 1, 1981 



National Aeronautics and 
Space Administration 

Jet Propulsion Laboratory 

California Institute of Technology 
Pasadena, California 


JPL PUBLICATION 80-73 



Fault-Tolerant Computer Study 

Final Report 

David A. Rennels 
Algirdas A. Avizienis 
Milos D. Ercegovac 


February 1. 1981 


National Aeronautics and 
Space Administration 

Jet Propulsion Laboratory 

California Institute of Technology 
Pasadena. California 


The research described in this publication was carried out 
by the Jet Propulsion Laboratory, California Institute of 
Technology, and was sponsored by the Naval Ocean Systems 
Center, San Diego, California, through an agreement with 
NASA. 


ABSTRACT 


This report describes a set of building-block circuits 
which can be used with commercially available microprocessors and 
memories to implement fault-tolerant distributed computer systems. 

Bach building-block circuit is Intended for VLSI Implementation as 
a single chip. Several building blocks and associated processor and 
memory chips form a self-checking computer module with self-contained 
input output and interfaces to redundant communications buses. Fault 
tolerance is achieved by connecting self-checking computer modules into 
a redundant network in which backup buses. and computer modules are 
provided to circumvent failures. 

Included in the report is a discussion of the requires. uis 
and design methodology which led to the definition of the building- 
block circuits. This is followed by a set of logic designs for three 
of the building blocks. These are designs which are being used to 
construct a laboratory breadboard of a self-checking computer module. 
The logic designs will be modified and improved as the breadboard is 
debugged and tested. Further refined designs will become available 
when the breadboard is completed and tested and again* hopefully, when 
the VLSI devices are fabricated. 


iii 


ACKNOWLEDGMENT 


This study wss initiated by the Naval Ocean Systems Center » 
Code 923» and represents a facet of a block--funded program entitled 
Integrated Circuit Technology « sponsored by the Naval Electronics 
Systems Command » Technology Division. The work was performed by agree* 
ment with NASA under Contract NAS7*100 at the Jet Propulsion Laboratory 
of the California Institute of Technology. This program is continuing 
under NASA sponsorship, and related system studies are being conducted 
at the University of California, Los Angeles under sponsorship of the 
Office of Naval Research, 

A special acknowledgment is due to Reeve Peterson of NOSC 
for his continued support and encouragement of this effort. We are 
also Indebted to Dick Urban and Ed Holland of NOSC for their guidance 
and support. 

For the continuing effort, which involves the detailed 
design and implementation of an engineering model of this work, an 
acknowledgment is owed to Lee Holcomb ol the NASA Office of Aeronautics 
and Space Technology for his support. 

An additional acknowledgment is due Jim Bryden of JPL whose 
help was invaluable in carrying out this study and bringing this final 
report to partur it ion. 


iv 


CONTOIT8 


1 


1.1 


1.2 


1.3 


1.4 


1.4.1 


1.4.2 


1.4.3 


1.4.4 


1.5 


1.6 


1.7 


1.8 


SUHMMnr AMD OVBRVIBW , — |-1 

SYSTIM RBQUIKBMBHTS — 

BOILDING-BLOCK OOHPUTB& BBQOIBBMIINIS 1-2 

» 

DESIGN APPROACH 1-3 

THE BUILDING- BLOCK CIRCUITS 1-6 

The Memory-Interface Building Block (MIBB) — 1-6 

The Core Building Block (Core-BB) ——— ——— —— 1-6 

The Bus-Interface Building Block (BIBB) — — — — — 1-6 


I/O Building Block (lOBB) 1-7 

SCCM PROPERTIES 1-7 

THE DISTRIBUTED COMPUTER (SCCH) ARCHITECTURE 1-8 

SWIMARY 1-10 

REPORT OUTLINE 1-10 


2 THE CONCEPTS OF FAULT-TOLERANT CCBfPUTING 2-1 

2.1 APPROACHES TO THE FAULT PROBLEM 2-3 

2.1.1 Tolerance and Avoidance: Coraplement&ry I4>Pi^odches 

to the Fault Problem — — 2-5 

2.1.2 Classes of Physical Faults — — 2-7 

2.2 TOLERANCE OF PHYSICAL FAULTS 2-10 

2.2.1 Fault Masking 2-10 

2.2.2 Fault Detection 2-12 

2.2.3 Recovery — — — — — - — 2-15 

2.3 FAULT- TOLERANT SYSTEMS 2-17 

2.3.1 Hardware-Controlled Recovery Systems — — 2-18 

2.3.2 Sof iware-Controllod Recovery Systems 3-19 


2.3.3 

2.4 

2.4.1 

2.4.2 

2.4.3 

2.5 

2.5.1 

2.5.2 

2.6 
2.6.1 

2 . 6.2 

2.6.3 

2.6.4 


Pault-Tolerant SulMqrataM ■ 

MODELING AND ANALYSIS 

Analytic Hodellngt Pananent Paults — ~ — 

Analytic ttodalingt Tranaient Panlta 

Heuristic Approaclieat Siaulation and Bxperimeota 

TOLERANCE OP MAN-MADE FAULTS 


Design Faults 

Interaction Faults 


CURRENT PROBLEMS AND PROSPECTS FOR THE FUTURE 

Reasons for Fault-Tolerance ~ 

A Design Methodology — 

Current Roadblocks ——— ——— —————————— 

Goals and Prospects 


2-21 ^ 
2-22 
2-23 
2-33 
2-41 
2-42 
2-43 
2-46 
2-47 
2-47 
2-48 
2-49 
2-51 


3 OBJECTIVES AND ARCHITECTURE SELECTION 3-1 

3.1 REQUIREMENTS FOR FAULT-TOLERANT BUILDING-BLOCK 

C(»1PUTERS (FTBBC) 3-3 

3.2 DISTRIBUTED COMPUTERS 3-5 

3.3 THE DISTRIBUTED COMPUTER M(N>EL 3-7 

3.3.1 The Intercommunication Bus Structure — — — — — — — 3-10 

3.4 FAULT-TOLERANCE OPTIONS 3-11 

3.4.1 The Terminal Modules ———————————— 3-11 

3.4.2 The High-Level Modules 3-12 

3.4.3 The Intercommunication Bus System Requirements — 3-13 

3.4.4 Architecture Selection —————— — — 3-14 

3.3 BUILDING-BLOCK DEFINITION 3-15 

3.5.1 The Self-Checking Computer Module (SCCM) ————— 3-15 


Vi 


3<S.2 The Mmory Interface Bulldlns Block (MIBB) — — — — 3-18 

■* ‘ 

3*5.3 The Core Building Block (Core-BB) ———————— 3-21 

3.5.4 The Bus Interface Building Block (BIBB) ————— 3-23 


4 

4.1 

4.1.1 

4.1.2 

4.1.3 

4.1.4 

4.1.5 
4.2 

4.2.1 

4.2.2 
4.3 

4.3.1 

4.3.2 

4.3.3 

4.3.4 

4.3.5 


BUILDING-BLOCK DBSCRimOHB 

THE HBHORY INTERFACE BUILDING BLOCK 


Honory Interface Building-Block Bequireaents — — 
Mamory Interface Building-Block Design — — — — 

Error Control Capabilities — —————— — — 

Design of Memory Interface Building Block — — 
Estimated Complexity of Implementation — — — — — — 

THE CORE BUILDING BLOCK 


Core Building Block Requirements — 

Core Building Block Implementation 

THE BUS INTERFACE BUILDING BLOCK (BIBB) 
Bus System Requirements — - — — — 

Bus Controller Functions — — — — — 

Bus Adaptor Functions — — — - 

BIBB Implementation — — — — — — 

BIBB Microprograms — — — — — 


4-1 

4-1 

4-1 

4-3 

4-20 

4-24 

4-:i 

4-52 

4-52 

4-55 

4-71 

4-71 

4-76 

4-79 

4-82 

4-114 


BIBLIOGRAPHY 5-1 

APPENDIX A-1 


Figures 

1-1 The Self-Checking Computer Module (SCCM) - — — 1-5 

1-2 Reliability Improvement Using SCCMs 1-8 






Vll 


1- J 

2 - 1 
2-2 
2-3 
2-4 

2- 5 

3- 1 
3-2 
3-3 
3—4 

3- 5 

4- 1 
4-2 
4-3 
4-4 
4-5 
4-6 
4-7 
4-8 
4-9 
4-10 
4-11 
4-12 
4-n 
4-14 
4-1') 
4-16 


Dlstrlbut«d Stmdby Redundant Architecture — — — 1-9 

Systen Reliability Predictiona — — 2-24 
Harkov Reliability Model for Closed Syatens — — — — 2-29 
Transient Fault Recovery Process — — — — ^ — — — — 2-36 

Transient Recovery in the Harkov Model — — 2-40 

Equivalent Form of the Harkov Model —————— 2-40 

A Non-Dedicated Distributed Computer Architecture — — 3-6 

The Distributed Processing Architecture 3-8 

The Self-Checking High-Level Module — — — — 3-16 

MIL- STD 1553A Formats 3-24 

Bus Interface Building Block Connections — — — — 3-26 
MIBB Subsystems — — — — — — 4-5 

General Flow Diagram — — — — — — — — 4-12 

Address Bus Interface — — — — - — 4-26 

Soft Name Checker (SNC) — — — — - — — — — — 4-27 

Address Parity Checker (APC) 4-28 

5-Input Morphic Comparator (MPC^) - — — — — — — — — 4-28 

Data Bus-Storage Array Interface 4-30 

Data/Check Bit Module (DBM, CBM) 4-31 

Memory Data Register - Data Bit Module (2X) — 4-32 

Memory Data Register - Clieck Bit Module 4-33 

Bit Interface Module (3X) 4-34 

Bit-Plane Interface Module (2X) 4-34 

Spare Plane Interface Module (3X for SP , 3X for SP. )- 4-35 

c) n 

Replacement Control Section (RCS) 4-35 

Krror Control Section (ECS) 4-36 

Data Parity Cherker-Cenerator (DPCC) 4-37 


vlii 


4-17 

4-18 

4-19 

4-20a 

4-20b 

4-20c 

4-20d 

4-21 

4-22 

4-23 


4-24 

4-25 

4-26 

4-27 

4-28 

4-29 

4-30 

4-31 

4-32 

4-33 

4-34 

4-35 

4-36 

4-37 

4-38 


Syndrone Gmerators/ChsclMrs (8GC) 6X — — — ~ 4-38 

sec/DBD Analysar (SDA) ———————— 4-40 

Error Status Ragiatar aad Haaory Intarnipt (RSR/MBt) - 4-41 

Control Intarfaca and Clock Ganarator (C16C6) — — — 4-44 

MCS State Diagram — 4-45 

State Sequencer (SS ) ——————————— 4-47 

A 

Core-BB Block Diagram 4-53 

The Processor Check Element — 4-57 

Processor Check Element Logic t (a) Parity Check/ 

Generate; (b) Morphic Processor Comparison; 

(c) Isolator; (d) Command Decoder; (a) Status 


Registers — — 4-60 

Bus Arbiter Layout - — - — 4-62 

Priority Resolver I^gic — — - — - — — — - 4-63 

Morphic and Currents: (a) Self-Checking Exclusive, 

or Reduction Circuit; (h) Reduction Trees 4-64 

Core Building Block - interconnection Diagram 4-67 

Fault Synchronizer and Recovery Sequencer — - — — 4-68 

Manual and External Module Control 4-72 

Simplified BIBB Block Diagram — — — — — — — 4-83 

External Bus Manager Block Diagram 4-85 

Manchester to NRZ Translator 4-88 

External Bus Interface, BAC - Control 4-91 

External Bus Interface, BAC - Data Paths 4-94 

External Bus Interface, BAC - Fault Det»*4tion l.ogU' -- 4-96 

The Internal Bus Interface 4-97 

The IBI - DMA Controller 4-KM* 

IBI - Fault Handling Circuits 4-IOJ 


ix 


4-104 


4-39(a) Th« Mill 

4-39(b) Tb« Hill - Fault UtcliM Cor Statuo Sample 4-105 

4-40 The Controller - CSOH and C8 ————————— 4-107 

4-41 B»e Control Saqumcer ——— —————— 4-109 

4-42 Fault Handler ————————— 4-113- 


Tables 

2-1 Characterizaticm of Several Models b Fault-Tolerant 


2-2 Algorithm for the Components of Matrix A — — — 2-32 

2-> Derivation of Transient Reliability Measures — — 2-38 

4-1 Odd-Weighted SBC/DED Code 4-22 

4-2 Conq>onent Count ——————————— 4-51 

4-3 Conditions for Examining Morphic Check Signals — — 4-69 

4-4 Memory Mapped BC Conmands ————————— 4-76 

4-5 Bus Control Table Pormata — — — — — — — — — — 4-77 

4-6 IBI Transfer Commands — — 4-98 

4-7 DMA Command Codes (OMAC) — — — - — - 4-99 

4-8 Control Sequencer Inputs - — — — — 4-106 

4-9 A Control Sequencing Exampl'^ — — — 4-110 

4-10 Bus Adaptor Microprogram ——————— 4-116 

4-11 Bus Controller Microprogram ———————— 4-121 


X 


SBCTtOM 1 

SUMMARY ABD OVRBVISH 


Over the laet decade, the awthodology of fault-tolerant 
cooputing haa been developed to increaae the reliability of craputer 
syetena. Fault-tolerant coaputera have been deaigned to contain redun- 
dant circuita and, when hardware faulta occur, they utilite the redun- 
dant circuita to continue correct coaputatioo. By and large, theae 
have all been cuatMser-deaigned coaputer ayateaa (AVIS 771. 

Thla atudy waa undertaken aa part of the N08C Very-Large- 
Scale-Integrated-Circuit Technology Prograa to define VLSI building- 
block circuita which can be uaed with coaoarcially available aicro- 
proceaaora und nefBoriea to iiaplenent fault-tolerant coaputer ayateM. 
Thia approach ia taken with the view that a wide range of governaent 
requireaenta can be aatiafied with coanerc^ally developed proceaaora. 
Thua, the direction of thia atudy ia to define the aupporting circuita 
neceaaary to utilize exiating proceaaora in fault-tolerant configura- 
t iona. 

Tlie principal reault ia a determination that a aaall nuaber 
of hullding-hlock circuita can be developed tdtlch will allow conatruc- 
tion of both centralized and diatrihuted (multi-computer) cooq>uter con- 
figurationa which are fault tolerant. Theae building blocka conaiat of 
(1) an Krror Detecting and Correcting Meaory Interface Circuit, (2) a 
CORK I’roceaaor (Checker and Fault -Handling Circuit, (3) a Self-Checking 

i 

Progrannablc Bua-Intcrfacc Circuit, and (4) aeverai I/O circuita to 
perform voting, error checking, and abort iaolation. The deaign of the 
firat three building blocka for a feaaiblllty breadboard are deacribed 
in thia report, along with the rationale behind their aelection. 

1.1 SYSTCf REQUIREMENTS 

Reliability ia a continuing problem in complex military 
aystema. The coat of unexpected failurea ahowa up in many waya, includ* 
ing reduced operational readineaa, and the lar.-'^* nuiri>er of peraonnel 
involved in maintenance. Dollar coata are uaiuilly difficult to quantify 


1-1 


Imcmim •yatm procttraoMt and eoata of oanar rt iip ara oaually parcallad 
loco varioiw acaaa of rMpMaibility. It coo ba aaid, Iwaavar, that 
coata of OMoarahlp oftm a a ca ad procoraoaot eoata in a larpa mnhar of 
■ajor ayataaa. 

By incroaalag taatability, aaiatainahility and. In a«M 
caaaa, providing autonatad radondaney nanagawant in tha aarly atagaa of 
a ayatam daaign, it ia ai^actad that lifa-cycla coata can ha raducad. 

Thia vlawpoint advocataa nodarataly incraaaing initial luirdMara coata to 
achiava in^rovad rallability and raducad naintananca during a ayatam* a 
oparational lifatina. 

Tha coB^utara within a ayatam provlda tha starting point for 
automatad naintananca. If computar rallability ia asaurad, tha con* 
putoro can ba uaad for (1) aubaystan tasting and failura diagnosia* 

(2) automatically raplacing failad aubayatana with apare parta, or 

(3) whare no backup spares ara available, modifying on*board processing 
to account for the degraded subsystem state. Stated another way, the 
computer becomes an automated repairman. 

A second area of requirements for fault*tolerant computing 
occurs when the cost of cooqMiter failure becoows clearly unacceptable. 
Digital flight control of low-flying aircraft is a dramatic example. 
Although the number of applications of this type is relatively low, 
they may be expected to increase as tha computer is relied upon more 
heavily. 

1.2 BUILDING-BLOCK COMPUTER REQUIREMENTS 

The user of a fault-tolerant building-block computer (PTBBC) 
system should be sllowed to specify a maintenance interval and the 
reliability required over that Interval. This has two major implica- 
tions. First, the FT8BC configurations SMist allow the nodular addition 
of redundant elements so that the same design, with differing numbers of 
spares, can v'.'imomical ly satisfy both short- and long-life requirements. 
Secondly, the fault detection and recovery mechanisms of the FTBBC must 
be nearly perfect. Previous modeling studies have shown that "coverage,” 
(the conditional probability that the system can implement recovery. 



t 


givon cliat a fault occura) must approach 100X ^or loBg-tftn rellahilltp, 
ufaather or not a fault- tolerant ayatam la periodically maintained 
CBOim 69]. 

In order to be effective* a fault-tolerant eonq^ter neat be 
dealgned to recover from a compr^ienalve aet of faulta* i.e* * all the 
faulta that can be reaaonably expected to occur* We have attempted to 
protect againat atuck-at faulta on a single Chip* moat maaaive failures 
In a single chip or module* and moat transient faulta Which create 
errors but which are of short duration. We do not expect unrelated 
hard faults to occur in different modules simultaneously. 

The FTBBC architecture must be amenable to easy maintenance. 
Plug-in replacement modules should require a minimum of contact pins and 
should not require connectors at high-band«ddth* noise-sensitive points 
in the computer. Similarly* the computer should be capable of identify- 
ing* during routine maintenance* those modules fdtich must be replaced. 

The architecture of the building blocks should be capable of 
supporting a wide variety of processor and memory chips* i.e.* the 
building block designs should- not depend iq>on the peculiar I/O charac- 
teristics of any given processor. By initiating all control and I/O 
functions with out-of-range memory addresses (memory-mapped I/O)* this 
processor independence can be achieved. 

For the building-block computers to find wide application 
they should be consistent with military standardisation programs. Thus* 
external bus interface circuits in the building block architecture use 
MIL-STD 1553A. 

1.3 DESIGN APPROACH 

After a study of alternative approaches to the design of 
building-block- Implemented, fault-tolerant computing systems* the 
following architecture was selected. The building-block circuits 
being developed are used to assemble commercially available micro- 
processors and memories into Self-Checking Computer Modules (SCCM)* as 



shows to Figure 1*1. Bach 8CCH is a small con^utar with the inmsual 
property that its hardware id enable of deteetii^ a wide ^iety of 
iaceroal faults coaeurreat with normal (user) progron execution. It 
can be connected (through a redundant external busing system), togethsf ' 
with otter SCQIs into a redundmt network, in i^ich teeing SC(^ are 
provided to take over for a computer (80(M) which has failte. 

As shown in Figure 1-1, thrM of the building blocks inter- 
face (1) Iccal memory, (2) the external busing syst«a, and (3) local I/O 
to the processor. These Interface building blocks are responsible for 
detecting faults in the circuits that they interface to the 8C(31*s 
processor, and faults in their own internal logic. They send fault 
indicator signals to the Core Building Block (Core-BB) if such a fault 
is detected. 

The Core Building Block compares the outputs of two CPUs 
performing Identical computations to detect (but not isolate) CPU 
faults, and it receives the fault signals from the other building 
blocks. It also checks error-detecting codes which are used to detect 
errors on the internal busses of the SCCH. The Core is responsible lor 
disabling the SCCH upon detecting a fault anywhere within it. (An 
optional program rollback may be attempted to recover from some transi- 
ent faults locally.) 

Although the primary means of fault recovery is to use backigt 
SCCMs to replace a SCCH which has failed, it is possible to. correct sohk 
of the most likely faults in a failed SCCH (by an internal reconfigura- 
tion) and reuse it. A SCCH can be reconfigured to recover from at 
least two local memory faults through use of two spare-bit planes. 
Redundant external Bus Interface Building Blocks (BIBB) allow communica- 
tion through alternate buses if a bus interface should fall, and redun- 
dant I/O Building Blocks can be used within a SCCM. (A design augmenta- 
tion currently under consideration, allows one of the two CPUs to be 
discarded when a disagreement occurs, and computation to continue with 
only one. This is for non-critical applications since CPU fault detec- 
tion Is no longer available with only one machine.) 


1-4 


BUS INnRFACE 
BUILDING BLO< 



Compute 

































1.4 


m Bmocro-BLoac cibcvxts 


fhe b^idiag-block circuits arc briefly described in tbs 
following paotavragtan. 

1*4.1 the Nnnery-lncerface beilding Block (Mira) 

This circttic isCerfeces a set of commercial memory chips to 
the local bus within a SCQt. It is capable of detecting single fanlts 
within the nwanory, effecting replacement of up to two fmlty bit planes 
with spares, and correcting single bit errors using a (gBC/DBD) Hamming 
code. It generates and checks parity codes to protect information 
transfer on the SCQi internal bos* Special checking circuits are employed 
in the MIBB to detect fanlts in the m^nory and within the MIBB, and 
fault signals are srat to the Core. 

1.4.2 The Core &iilding Block (Core-ra) 

This circuit provides a continuous comparison between two 
processors that run synchronously to detect processor faults. It also 
includes parity generation and checking circuits to Interface the proc- 
essor with the SCCM local bos and to detect faults on chat bus. Inter- 
nal bus allocation (arbitration) is provided between the CPU and compet- 
ing DMA channels in the other building blocks. Also, Che Core is respon- 
sible for disabling its host SCCM in the presence cf faults and, option- 
ally, attempting rollback/restarc procedures. The Core, like all other 
building blocks, contains internal checking circuitry to detect faults 
within its own internal logic. 

1.4.3 The Bus-Interface Building Block (BIBB) 

This circuit can be microprogrammed to perform the functions 
of either a controller or terminal (adaptor) to an external 1553A bus. 
Several BiBBs can be used within an SCCM to provide communications over 
several redundant external buses. 

The BIBB provides the hardware interface between an external 
bus and the internal bus of its host SCCM. Internal fault-detecting 
circuitry is provided within the BIBB, and the parity and status 


1-6 


BMsages MKployad In 1553A ar« uand to varify pto^ masago tronmlsstoii 
and rocoption. 

1.4.4 I/O liiildins Block (nm) 

A discussion is includod Istsr la tlds report on the various 
circuits required to provide fault-detection and redtndancy in the 
interfaces between aa SCQI and its associated peripheral devices. 

1.5 SCCM PROPERTIES 

A ”t)rpical*' SGQf would consist of the following integrated 
circuits: 32 commercial BAM chips, 2 conmiercial microprocessors, 

1 MIBB, 1 Core, 3 BIBBs, two IWBs, and several additional 161 cir- 
cuits. A previous report has indicated that its characteristics would 
approximate those listed below if the building blocks mre implemented 
as VLSI devices. (RENN 78a) 

Power 8U 

Height 1.4 lb* 

Volume 23 in. ^* 

Cost $13,600* 

*Not including power supply. 

The cost represents high reliability production, (e.g., 
MIL-STD 883B) and could be greatly reduced in large quantities. Fig- 
ure 1-2 is an estimate of the reliability of a single SCCM, a SCCN 
backed up by a standby spare, and, for comparison purposes, a non- 
redundant computer made with similar technology. A simple combinational 
model was used (see RENN 78a) and it was assumed that a 10,000-gate 
VLSI device has a failure rate of one failure per million hours. An 
SCCM costs approximately SOI more in power, weight, volume, and dollars 
than an equivalent non-redundant machine; but since it can tolerate 
internal memory faults, its Inherent reliability is 2-3 times greater 
(over the period being modeled). A pair of SCCMs can provide fully 
fault-tolerant operation with very much Improved reliability. 


1-7 


r 



Figure 1-2. Reliability Improvanent Using SCCMb 

1.6 THE DISTRIBUTED COMPUTER (SCQl) ARCHITECTURE 

An architecture has been selected for Implementing fault- 
tolerant distributed computing networks made up of SCCMs. The selected 
architecture consists of a number of computers (SCCMs) performing 
separate tasks, and which are connected by a redundant multiple bus 
structure, as shown in Figure 1-3. 

There are two classes of SCCMs used within this network, 
designated Terminal Modules and High-Level modules. Each Terminal 
Module is embedded within a particular subsystem and performs local 
control and data gathering tasks. The High-Level computer modules con- 
trol the functioning of various terminal modules by controlling an 
intercommunications bus. Using the bus, a High-Level SCCM can move data 
directly into or out of memories of other computers and thus broadcast 
commands or gather data for its various processing functions. 

In this configuration, several techniques are employed to 
achieve fault tolerance. First, all of the computers are self-checking 
(SCCMs) and are designed to detect their own internal faults. 



HIGH-LEVEL 
COMPUTER 
MODULES (HLM) 


HLM 

SPARE 

COMPUTER 

MODULES 



I/O INTERFACE 4 


TERMINAL 

MODULES 

TM 



2. BIBBS 

3. CORE -as 

4. lOBB 


TO SUBSYSTEM 


Figure* 1-J. n 1st r ibut Standby RodundaiU Architecture 


Si*condly, backup spares are employed to replace faulty 
computer nu>dules. In tlie case of High-Level modules, spares are iion- 
dedicated. A faulty module disables its own bus control function. 

Spare moduli*s are programmed to delect the resulting lack of activity 
and tike iwt*r t lie ongiung comput at ions . For a Terminal Module, a failure 
is indicated through the bus system (by polling), and a High-Level Module 
effects its replacement by act ivat ing a dedicated backup spare module. 

Thirdly, a highly redundant bus system is employed so that a 
faulty bus may bi* replaced by a spare. In the case of single faulty 
termi Hills, iiulividiLil messages may be rerouted over different buses. 
Automat ii* status mess.iges are employed in the bus form.it to verify 
proper iransmissiiui and recept ion of messages. 

A more det.iiled description of this arch i t i*c l ure can be 
found in KFNN 78b. 














1.7 


SUMMARY 


This phase of the building-block, Fault-Tolerant Computing 
Study has two intended results. The first is the design of three 
building block circuits: (1) the MISS, (2) the Core, and (3) the BIBB. 

The second is the verification of the building-block designs by con- 
structing a breadboard, consisting of two SCQts employed as high-level 
modules. . This can be done by injecting simulated faults into one SGGH 
and verifying that the fault is detected, and the other SCCM recovers 
correct computations. 

This report describes the design of the building-block 
circuits. The designs presented herein have been used for the initial 
breadboard layout, and will be modified as debu gg ing progresses. 

1.8 REPORT OUTLINE 

The following two sections (2 and 3) provide background 
material on the methodology of fault tolerance, and the specific assump- 
tions on technology and application requirements which led to the 
selection of the building-block SCCH architecture described in this 
report. The reader who is interested primarily in design details can 
skip to Section 4, which provides more detailed descriptions of the 
individual building-block circuits. 


1-10 


8BCTI0N 2 

TRB COIOPTS OP PAOLT-mnANT COMPOTING 

Th« parposa of this section is twofold t 

(1) to provide the overall context of fanlt^tolerant 
computing as a discipline of computer science and 
engineering within which the specific results of this 
study are to be interpreted; and 

(2) to supply a self-contained complete introduction to 
fault-tolerant computer systems for readers who have 
not encountered this aspect of computer system deai^i 
in the past* 

A fault is an abnormal condition that appears during the 
operation of an information processing system. Its manifestation may 
cause a departure from the expected behavior and force the system into 
an undesirable (error) state or sequence of states. The arrival at an 
error state* in turn, leads to a partial or complete failure of the 
system to execute the specified function, unless provisions exist to 
cause a return to the expected behavior. Causes of faults are either 
adverse natural phenomena or human mistakes. Because of their disrup- 
tive effect on system operation, the avoidance and/or tolerance of 
faults are major problem areas in contemporary information-processing 
activities. Including the design, analysis, management, and use of 
information systems ; 

The word "fault'* in the subsequent discussion means "an 
abnormal condition of hardware, programs, or data that may cause a 
deviation of the information-processing behavior of some part of the 
given system from the expected sequence," and "system" comprises all 
hardware elements, programs and microprograms, input signals, stored 
information, inter-system communication, and man-machine interaction 
functions. All these parts of the system have to bv considered because 
in practice they all are affected by faults. As a consequence, the 
fault problem transcends the traditional "hardware-software" applica- 
tions boundaries and becomes a global problem of information processing. 


2-1 


Tha word "«ucpoct«d** is prsCsrrsd to ths word "corrsct" In 
ttm description of fault-free behavior twcanse the question of correct 
behavior* as it has been specified by the originator or user of the 
system* exceeds the scope of fault-tolerance considerations. For example* 
the choice of an unsuitable algorithm by ths user will lead to expected 
behavior that is not correct with respect to the user's ultioMte goal* 

The various types of faults that are encountered during sjrs- 
tem operation fall into two fundamentally distinct classes: physical 

faults and man-made faults. Fhvsical faults are faults caused by 
adverse natural phenomena* such as failures of hardware cOTponents* and 
physical Interference originating in the environment. Hen-made faults 
are faults that result from htman mistakes* Including less than perfect 
specification, design* production (assembly)* and man/machlne 
interaction. 

Fault-tolerance is a property of the entire system that 
allows it to continue the expected behavior regardless of the appearance 
of certain (explicitly specified) classes of faults (physical* man-made 
or both) that would otherwise force the system into an error state. The 
most commonly accepted notion of fault-tolerance refers to phyctcal 
faults only. The inclusion of man-made faults is a recent generaliza- 
tion that offers a major challenge to investigators and designers of 
information processing systems. 

A complete discussion of fault- tolerance must deal with its 
three fundamental aspects: 

(1) The pathology of faults* including study of their 
causes* classification according to their immediate 
manifestations* and characterization according to the 
symptoms (errors) observable in system behavior. 

(2) The implementation of tolerance , encompassing the 
three basic functions of masking, detection* and 
recovery. 


2-2 


( 3 ) 


Th« nodaling, aaalysiat aid avaluatioa (naaauraiMnt) 
of fault-tolaraaca by naana of natheiMtical tacbnlquaa* 
aioulationt and aiqparln«itatlon with laylanantad 
ayatam. 

Tha goala of thla aactlon arat (a) to preaant a unlfiad 
view of tha many aapacta of fault-toleranca; (b) to Identify aoma 
obataclaa that remain to be ovarconw; and (c) to dlacuaa the proapacta 
for future advances In thla field* Fault-tolerance with reapect to 
both phyalcal and man-made faulta la conslderadt with emphaaia on the 
more developed field of tolerating phyalcal faulta. The current atate- 
of-the-art In the deaign and application of fault-tolerant ayatema la 
illustrated by examples of existing aystams and itmovative proposals. 

The viewpoint presented here is that the purpose of fault- 
tolerance is to provide the means for the idealised (fault-frea) 
abstract logical structure of a computing system to ftniction success- 
fully while embodied in its fault-susceptible implementation. Conse- 
quently, fault-tolerance attains full significance only when it is 
incorporated and utilized as an integral function of an information 
processing system. Outside of this system context, it remains, at best, 
a potentially applicable exercise for a researcher, and at worst, a 
tool to support naive or irresponsible promises of near-perfect 
operation. 

2.1 APPROACHES TO THE FAULT PROBLEM 

While conceptually the digital computer is a logical system 
for the storage and manipulation of symbols, in practice it is imple- 
mented using physical components and exists in an environment in which 
it is affected by various natural phenomena. Some phenomena, such as 
physical changes in the components and adverse effects of the environ- 
ment, disrupt the operation as it is specified by the designers and 
programmers and lead to deviations from the expected behavior. These 
deviations have variously been called failures, faults, errors, inter- 
mit tents, glitches, crashes, etc. They occur because we attempt to 


2-3 


carry out abstract ayniboi BMilpulation oparatloaa in a physical world 
which offers lass than perfect cooponmta and lass than conplataly 
bonign anvironnonts. 

The probleoM of avoiding thoaa phonmaonat and of recovering 
froa their affects after they have occurred* hive been of interest to the 
entire comnunity of conputer thaoriata* deeignera* builders* analysts* 
and users ever since the first calculating devices were devised. The 
first pioneers who attempted to inplesMnt their ideas were simply over- 
whelmed by the adversity of the physical world, such as in the case of 
Babbage's Calculating Engine. 

The invention and refinement of electromagnetic relays* 
vacuum tubes* delay-line ai^ cathode-ray tube storage* paper tape* and 
punched cards finally made machine computing feasible in the 1940*s. 
However, the history of the early dvys of machine computing is filled 
with accounts of the continuing sti<igglc against the imperfections of 
coaiponcnts and hostility of environments. Ingenious defenses against 
faults* such as duplicate units* error-detecting codes* etc.* are found 
in most early digital computers. [IRE 53]* (EJCC 53]. 

The advent of the transistor and the magnetic-core storage 
element in the 19S0*s brought about a major increase in component reli- 
ability and at least temporarily relegated the concern with system 
reliability into the hands of component experts* and away from the main 
concerns of system designers and users. 

The problem of reliability reappeared as a major issue 
again in the early 1960* s when the applications of computers expanded 
into the areas of space exploration* real-time system control* and 
especially suinned space-flight* in which the lives of the crew literally 
depended on successful computer operation. 

The reliability of coiqionents has continued to improve 
since that time. However* the expanding range of applications and the 
growing complexity of systems has kept the reliability problem In the 
foreground and has led to the evolution of the concept of fault-tolerant 
computing, which is the designer's and the programmer's method to pro- 
vide reliable computer operation while using less than perfect components 


2-4 


in 1«M tlim ideal anvironMota {AVIS 7Sa]. TIm majar part of ehla 
aactlon conaldara tha tolaraaca of piqroical faultat tiM laoMO of ma- 
OMda Caulta la addraaa^ in Saetioa 2.3 

2.1.1 Tolaraaca aad Avoidaacai ConplaMacary Approachaa to tlia 

Fault Problaa 

A look at coaputara of tha praa«it aad of tha laiMdiata paat 
ahowa chat siaoy ayacam hava althar vary faw fault-tolaraaea faaturaa* 
or nooa at nil. la chaaa caaaa, raliabllity oltb raapact to phyalcal 
faults is sob»nt by naaas of the fault«»avoidanca approach (also eallad 
"fault-intoloraoce” in some papara) ia which tha reliability of coiqput« 
lag is assured by a priori aliainatloa of the causes of faults. Tha 
ellfflinacion takes place before regular use bagiaa* aad tha rasourcas 
that are allocated to attain lellability are spent on perfecting tha 
systen prior to its field use. Redundancy is not employed, and all 
parts of the system must function correctly at all times. Since in . 
practice it has not been pussible to assure the cosq>lete a priori 
elimination of all causes of faults, the goal of fault-avoidance is Co 
reduce Che unreliability (expressed as the probability of syst«a failure 
before the end of a specified time interval) cf the system to an accept- 
ably low value. To supplement this approach, manual maintenance proce- 
dures are devised which return the system to an operating condition 
after a failure. The cost of providing maintenance personnel and the 
cost of the disruption and delay of computing also are parts of the 
overall cost of using the fault-avoidance approach. The procedures 
which have led to the accaimMnC of reliable systems using this approach 
are: 

(1) Acquisition of the most reliable components and their 
testing under vat I jus cendicions within the given cost 
and performance copitr.tints. 

(2) Use of thoroughly refined techniques for Che intercon- 
nection of components and assembly of subsystems. 

(3) Packaging and shielding of the hardware to screen out 
expected forms of external interference. 


2-5 


(4) Canryint out of c omp ro h onalv totting of tho conploto 
tyotom prior to ito uto. 

Onco tiM dooign hto boon rcmplotod* o quontitativo prodic- 
tion of ayotaa raliabllity it node uaing known or prodictad failura 
rataa for tha componanta and Intarconnactiona. In a **puraly** fault- 
avoiding (i.a.» nonradundant) doaign* tha probability of fault-fraa 
hardwara oparation ia a^uatad to tha probability of corract progran 
axacution. Such a daaign la charactarlcad by the dcciaion to Inveat all 
tha raliabllity raaourcaa into high-raliability co%i' lenta and refina- 
wmnt of aaaaadtly, packaging* and taating tachniquaw. Occaaional ayatem 
failuraa are aceaptad aa a nacaaaary avll* and nanual maintenance ia 
provided for their correction. To facilitate aaintenancA* aomc built-in 
error detection* diagnoaia* ami retry techniquea are provided. Thia is 
the moat common current practice in computer system design; the trend is 
toward an Increasing number ot built-in aids for the maintenance 
engineer. 

The traditional fault-avoidance approLch of diagnosis-aided 
BMnual repair* hoifevcr. ha.- proved to be an insufficient solution in 
many cases because of at least three reasons: the unarceptabil Ity of 

the delays and interruptions of real-time programs (air traffic control* 
process control* etc.) caused by manual repair action; the inaccessi- 
bility of some systems (space* undersea* etc.) to manual repair; and 
the unacceptably high cost of lost time due to manual maintenance in 
many installations. The direct dependence of human lives on some 
computer-controlled operations (air traffic control* manned spaceflight* 
etc.) has added a psychological reason to object to the fault-avoidance 
approach: although only one system in a million is expected to fail In 

a given time interval* all users of tho entire million systems are sub- 
ject to the anticipation that they may be involved in this failure. 

An alternate approach which alleviates nx>sc of the above 
shortcomings of the traditional fault-avoidance approach is offered by 
fault-tolerance. In this approach the reliability of computing is 
assured by the use of protective redundancy. Faults are expected to be 
present and to cause errors during the computing process* but their 
effects are automatically counteracted by the redundancy. Reliable 




co^niting Is mado possible despite certain classes of hardware failures, 
external interface with computer operation, and perhaps even smte man- 
made faults in hardware and software. Part of the resources allocated 
to attain reliability are spent on protective redundancy* The redundant 
parts of the system (both hardware and software) either take part in the 
computing process or are present in a standby condition, ready to act 
automatically to preserve its undisrupted continuation. This contrasts 
with the manual maintenance procedures idtlch are invoked after the 
computing process has been dlsrt4>ted, and the system remains "down” for 
the duration of the maintenance period. 

It is evident that the two approaches are complementary and 
that the resources allocated to attain the required reliability of com- 
puting may be divided between fault-tolerance and fault-avoidance. 

* 

Experience and analysis both indicate that a balanced allocation of 
resources between the two approaches is most likely to yield the highest 
reliability of computings Fault-tolerance does not entirely eliminate 
the need for reliable components; instead, it offers the option to 
allocate part of the reliability resources to the inclusion of redun- 
dancy* One reason for the use of a fault-tolerant design is to achieve 
a reliability or availability prediction that cannot be attained by the 
purely fault-avoiding design* A second reason may be the attainment of 
a reliability (or availability) prediction that matches the purely 
fault-avoiding design at a lower overall implementation cost* A third 
reason is the psychological support to the users who know that provisi- 
ons have been made to handle faults automatically as a regular part of 
the computing process* The fault-avoidance approach clearly was the 
dominant choice in the 1950*s and 1960*s* In recent years, the fault- 
tolerance approach has been making significant inroads with respect to 
physical faults. Its application with respect to man-made faults has 
remained very limited. 

2*1.2 Classes of Physical Faults 

Physical faults are caused by three classes of phenomena 
that affect the hardware of the system during execution of programs. 

They are permanent failures of hardware components, temporary 


2-7 


malfunctions of components, and external interference with system opera* 
tion. niere are three useful dimenaions for the classification of 
physical faults t 

(1) Duration: transient vs. permanent 

(2) Extent : local vs. distributed 

(3) Value: determinate vs. indeterminate 

Transient faults are faults of limited duration, caused 
either by temporary malfunctions of components or by external interfer- 
ence. The characterisation of a transient fault must Include a 
"maximum duration" parameter; faults that last longer will be inter- 
preted as permanent by recovery algorithms. Other characteristics are 
the arrival model and the duration of transients lAVIZ 75a). Permanent 
faults are caused by irreversible failures of components. They are 
characterised by the failure rate parameter; often two or more failure 
rates are used for the same components under different conditi such 
as power-on and power-off states. The following classifications 
according to extent and according to value are applicable to both tran- 
sient and permanent faults. 

The extent of a fault describes how many logic variables in 
the hardware are simultaneously affected by the fault which is due to 
one failure phenomenon. Local (single) faults are those that affect 
only single logic variables, while distributed (related multiple) faults 
are those that affect two or more variables, one module, or an entire 
system. The physical proximity of logic elements in contemporary MSI 
and LSI circuitry has made distributed faults much more likely than in 
the discrete component designs of the past. Distributed faults are 
also caused by external interference and by single failures of some 
critical elements in a computer system, i.e.. clocks, power supplies, 
switches used for reconfiguration, etc. 

The value of a fault is determinate when the logic values 
affected by the fault assume a constant value ("stuck on 0" or "stuck 
on 1") throughout Its entire duration. The fault is indeterminate when 
it varies between "0" and "1" throughout the duration of the fault, 
but not in accord with design specifications. The determinacy of a 


2-8 


fault depands on the failure iMchaniam. For aiKanple* drift of oaaponant 
values or "shorting together" of ttro signals are likely to cause lndetdr> 
ninate faults. 

It is ^nportant to note chat Che description of fault extent 
and fault value applies at Che origin of the fault; that is* at the point 
at %ihich the failure phenommon has actually taken place. The f wit- 
caused Introduction of one or more incorrect logic values into the coo^ 
puting process oftw leads to more extensive fault symptoms farther away 
(in space and/or in time) from the point of failure. At other times, the 
presence of incorrect logic value is masked by other (correct) logic 
variables and no symptoms at all appear at more rai»te points. Confu- 
sion and and>iguity are avoided when the term "fault" is restricted to 
the change in logic variable(s) at the point of the physical hardware 
failure. The fault-caused changes of logic variables tdkich are observed 
farther away on the outputs of correctly functioning logic elonents will 
be called "errors." This choice of terms describes the following cause- 
effect sequence: 

(1) The failure , which is a physical phenomenon, causes a 
fault , which is a change of logic variable(s) at the 
point of failure. 

(2) The fault supplies Incorrect input (s) to the computing 
process and may cause an error to be produced by sub- 
sequent operations of failure-free logic circuits. 

The number of points that can be obsetved for the purpose of 
fault detection is limited because integrated circuits are internally 
complex, and have relatively few outputs. Digital- logic simulation pro- 
grams which analyze the behavior of faulty logic circuits and predict 
the errors that will appear on the outputs (for a given class of faults) 
are essential tools for the generation of fault-detection tests 
ISZYG 76]. An illustration of a simulation and analysis program to 
analyze the behavior of faulty circuits is the Logic Analyzer for Main- 
tenance Planning (LAMP) system [CHAN 7A]. In addition, LAMP also per- 
forms logic design verification, generates fault-detection tests, evalu- 
ates diagnostics, and produces data for trouble— location manuals. LAMP 


2-9 


exaoplifltts eh« current tread tewerd nultipurpoee simulation syetaiie in 
digital systOD design. 

2.2 TOLES/dlCE OT PHYSICAL FAULTS 

Fattlt**tolerance functions in computer systons are not 
necessary (redundant' as long as faults do not occur* and they can be 
deleted from a perfectly fault-free syston without affecting its per- 
formance. In fault-susceptible syst«ns they are implemented by the 
Bteans of protective redundancy * which becomes effective when faults 
occur. 

The Inplanentation of fault-tolerance may be discussed from 
t«N> viewpoints: according to the fwactlons being performed* and accord- 

ing to the forms of redundancy that are used to provide these functions. 
From the functional viewpoint we distinguish three classes of fault- 
tolerance functions: masking* detection and recovery. Each class con- 

tains several distinct approaches to implementation which will be dis- 
cussed In this section. The other viewpoint distinguishes different 
forms of protective redundancy. The redundancy techniques have been 
developed to enable three different forms: hardware (additional compo- 

nents)* software (special programs)* and time (repetition of operations). 

In this discussion* the functional classification is con- 
sidered to be most suitable for the exposition of implementation tech- 
niques. Each function is discussed separately* outlining the redundancy 
techniques that are available for its implementation. 

2.2. t Fault MasMng 

The masking function employs redundancy to assure that the 
effect of a fault is completely contained within a system module. As 
long as the redundancy is not exhausted* the fault is concealed within 
the module and no symptoms whatsoever appear on its outputs. When the 
redundancy Is exhausted or overwhelmed by a fault* module failure 
results. Separate detection and recovery functions are not identifiable 
when the module is viewed from outside. Because of this* masking has 


2-10 


be«n cail«d a static redundaacy tsclmiqua [8B0& 68] and has bean used in 
the design of various structures» e.g. » airplane €ranea« bridges* etc.* 
prior to the appearance of digital qrstetos. Mawbing la also thought to 
be the form of fault^tolerance ueed by the nervous systems of living 
organisms [VCNSN 56]. 

A key question In masking 1 < choice of the else of the 
module within ^Ich the masking occurs* The smallest module Is a set of 
Individual hardware components (e.g. * diodes* relay contacts* connec- 
tions* etc.). On the other extreme, a module may be as large as an 
entire computing system, in which case the module terminals are the out- 
put devices. Theoretical analyses of masking usually do not specify the 
module size; it depends on the feasibility of implementation. 

In digital systems, masking is usually accomplished by hard- 
ware redundancy, i.e., by the replication of hardware elements. The 
fundamental theoretical analysis of masking is due to von Neumann 
{VONN 56 J, and Moore and Shannon [MOOR 56]. Its early appearance can be 
attributed to the previous use of masking in other disciplines of engi- 
neering. The techniques of introducing hardware redundancy have been 
classified into two categories: static and dynamic [SHOR 681. The 

static method Implements the masking function, since the redundant 
components contain the effect of hardware failures within a given hard- 
ware module, and the outputs of the module remain unaffected as long as 
the redundancy is effective. The static technique is applicable against 
both transient and permanent faults. The redundant replicas of an 
element are permanently connected and powered; therefore, they provide 
fault masking Instantaneously and automatically. However, if the redun- 
dancy is exhausted, or if the fault is not susceptible to masking suid 
causes an error, a delayed recovery is not provided. In practice, we 
find that two forms of static redundancy have been applied in U.S. space 
program computers: replication of individual electronic components, and 

triple modular redundancy (TMR) with voting [CCOP 76]. Several other 
forms have been studied but were not applied either because of their 
excessive cost or because they required practically unrealizable special 
components ( SHOR 68 ) . 


Th* use of static hardware redundancy Is based on the sssisq^ 
tlon that failures of the redundant replicas are independent. For this 
reason, use of static redundancy ie difficult to Justify within inte- 
grated circuit packages. In which many failure phenomoia are likely to 
affect several adjacent components. Other disadvantages include the 
cost of massive replication (3, 4 or more times the number of original 
system elements), the need to assume independent failures of the repli- 
cas. and the absence of a warning when a redundant module finally falls. 
Thus, masking is close to fault avoidance: while it may postpone the 

time of failure, the module still fails suddenly and irrecoverably when 
its internal redundancy is exhausted. 

Regardless of these shortcomings, masking still may find 
application because of its conceptual simplicity and its instant action, 
entirely .ransparent to the user. A promising area of application is in 
protecting a small "hard core" of a system for which other approaches 
are extremely costly or altogether Impractical. Another area Is the 
application in non-electrical, discrete-component technologies, such as 
fluidic logic for high-temperature or extreme radiation environments. 

2.2.2 Fault Detection 

The detection function is the starting point of all fault- 
tolerance implementations except for these that depend exclusively on 
masking. The most sophisticated recovery methods are only as good as 
the fault detection scheme which Initiates their operation. For the 
purpose of this discussion we say that fault detection has taken place 
at the time Instant at which a fault signal becomes available to be used 
by a recovery algorithm. All subsequent fault-location actions are con- 
sidered to be part of the recovery algorithm. The existence of a false 
fault signal is also possible. This is a false alarm that is due to a 
malfunction of the fault detection scheme itself. 

Fault detection is implemented by means of all the hardware, 
software and repetition (time) methods that generate the initial fault 
signal. All these methods may be conveniently grouped according to the 


2-12 


tine o£ their application with reapact to the normal operation of the 
system as follows: 

(1) Initial testing . ^Ich takas place prior to nomal use 
and serves to identify faults hardware elements con« 
taining imperfections introduced during the manufac* 
turlng or assranbly processes. 

(2) Concurrent (on-line) detection* Which takes place 
simultaneously with normal operation of the system. 

(3) Scheduled (off-line) detection* which takes place when 
normal operation Is temporarily interrupted. 

(A) Redundancy testing* which serves to verify that the 

various forms of protective redundancy are themselves 
fault-free* and takes place either concurrently or at 
scheduled intervals. 

Initial testing follows the production of individual cir- 
cuits and serves to eliminate the circuits that contain manufacturing 
defects [BREU 76]. Computer programs for test generation have become 
an essential tool to facilitate initial testing [SZYG 76]* [CHAN 74]. 

The great internal complexity and a relatively small number of input/ 
output points in contemporary LSI circuits (e.g. , microprocessors* 
memories* etc.) have made exhaustive logic-level testing* in many cases 
economically unfeasible. Recent research has emphasized probabilistic 
approaches (PARK 76] and combined logic and functional testing [MCPH 76]. 
Initial testing represents a significant part of the total cost of digi- 
tal circuits and is likely to remain a high-priority research problem 
for the foreseeable future. 

Concurrent (on-line) fault detection during system operation 
is Implemented by means of special hardware or software that operates 
concurrently with the regular programs of the system. An important 
advantage or concurrent detection is that recovery can be initiated 
before fault-caused errors can cause extensive disruption "if programs or 
damage to the data. Hardware methods for concurrent detection have been 


2-13 


used since the first gsnerstlon of computers. They Include error- 
detecting codes (psrlty, etc.) (AVIZ 71e]» (DOWN 64), dupllcstlon end 
conpsrlson, (OOVOl 64) dlsagrement detectors vlth msjority voters, 

(ANDB 67] speclsl circuits to ntonltor certsln crltlcel elements (clocks, 
power supplies, memory write operation circuits, etc.), (DOWN 64] 
machine status and completion signals, (AVIZ 7ta] self-checking logic 
circuits, (CART 74] and checksumming, timers, and built-in test equipment 
of various types. 

Software methods for concurrent detection either employ the 
concurrent execution of two (or more) programs, or they consist of spe- 
cial features Interwoven with the single program being executed. In 
the case of two or more Identical programs using separate processors 
and/or multiple storage in separate memories, a comparison Is accom- 
plished by a programmed exchange of results [WENS 76] or checksums, 

[SKLA 76] rather than by hardware comparators. An alternative is to 
use a dedicated subsystem (e.g. , a “maintenance" minicomputer) which 
executes monitoring programs to observe the operation of the remaining 
parts of the system. Fault detection features that can be interwoven 
with a single program include the use of passwords, acknowledgments 
(“handshakes"), checksumming, reasonableness checks on results, pro- 
grammed "watchdog" timers, etc. Compared to hardware methods, fault- 
detection by software is leas prompt and more susceptible to disruption 
by the fault itself. It is used very widely because it can be super- 
imposed relatively easily on an already existing hardware system. 

Scheduled (off-line) fault detection is implemented by means 
of software and requires the interruption of current programs In order 
to test for the presence of faults. The presence of errors caused by 
transient faults can be detected by repeating the execution of the same 
progru:" (or a program segment) and comparing the results. The detection 
of permanent faults which may have occurred since the last test period 
requires the running of diagnostic programs or microprograms (BRED 76], 
)Dt)WM 64], [RAMA 72]. In principle they are quite similar to the pro- 
grams for initial testing. The main differences are: time for testing 

is usually more strictly limited; testing is executed by the system 


2-14 


itself rather then by snother computer{ ^ an interemmectad aaeenblage 
of various clrcuita must be tested, rather t^an one circuit at a ilna* 

A "bootstrap" approach is very useful, in Which a small part of tha 
syst«tt is tested first, and then the tested part is used to run furthar 
tests on other parts, etc. Microdiagnostics have very good resolution 
and are especially suitable for this approach ChAMA 72]. Modem systems 
also frequently contain special hardware features (e.g., teat points) 
tihich facilitate diagnostics [CART 64]. Although the present discussion 
deals with use of diagnostics and microdiagnostics for initial fault 
detection, we must note that they also often serve to locate detected 
faults to within a replaceable or discardable module as part of the 
recovery algorithm. 

Redundancy testing is a function that is specifically needed 
by the fault-tolerance features of a system. Its purpose is to verify 
that these features will be ready to use when a fault occurs. An 
especially important aspect is to test that various fault signals are 
ready to act, l.e., that they are not "stuck" in the "no-fault" state. 
Self-checking logic [CART 74] and periodic schedule tests of fault sig- 
nals [CONN 72] are suitable here. A second aspect is the checkout of 
redundant parts of the system (e.g., standby spares, copies used for 
masking, etc.). While diagnosis programs are suitable for systems with 
standby spares [AVIZ 71a], the systems with masking are much more diffi- 
cult to clieck out, especially those in which masking is at the component 
level [COOP 76] . 

2.2.3 Recovery 

The recovery algorithm comprises all actions that are ini- 
tiated by the arrival of a fault signal during normal operation and are 
concluded by the resumption of normal operation (possibly in a degraded 
mode), by a systematic shutdown of the system, or by system failure. 

The most fundamental difference between various recovery 
algorithms is whether interaction with a human maintenance operator is 
or is not required as part of the recovery algorithm. Recovery 
algorithms that do not require human decision making are automatic ; all 


2-13 


otter alforlttes aro manually cmttollad. altteugih thay may contain 
axtanalva automatic (prograomad) aeqvancaa. An automatic racovary 
algorithm may mate uaa of off-llna manual rapalr Which tatea placa latar, 
aa long aa raaumptlon of normal oparatlim doaa not dapond on manual 
Intarvantlon. Automatic racovary algorlthme ara further claaaifiabla 
(according to tha atata of tha ayatem after racovary haa tean complatad) 
Into thraa clasaaat full recovery* degraded recovery* and aafe shutdown. 

Full recovery means the return of the system (within allowed 
tine limits) to a set of conditions that existed before the fault 
occurred (AVIZ 71a], Both the hardware and software possess the same 
computing capacity as before. Failed hardware modules are replaced by 
spares. Damaged information (programs and data) are returned to a known 
good state that existed prior to the fault. 

Degraded recovery (often called "graceful degradation," or 
"fallsoft operation") returns the system to a fault-free state, but with 
a reduced computing capacity (BEDS 69]. This means that some hardware 
elements have been discarded without replacement, some programs and/or 
data have been lost, or some functions have taken longer than the 
allowed time. This approach may be called "partial fault-tolerance," 
since recovery Is not 100Z successful with respect to the set of pre- 
fault conditions. Various "cold start" procedures belong to this 
category. 

Safe shutdown (also called "fall-safe" operation) is the 
limiting case for degraded recovery. It is carried out when the remain- 
ing computing capacity (if any) is below the minimum acceptable thresh- 
old. The goals of shutdown are: to avoid damage to remaining stored 

information and good system elements; to cease interaction with other 
systems and/or human users in a specified orderly fashion; and to 
deliver shutdown messages and diagnostic information to designated sys- 
tems, users, or maintenance specialists. 

Full recovery, degraded recovery, and safe shutdown all 
require certain subsidiary functions which follow fault detection. They 
are: fault identification and location, error correction in programs 




and data* raplacaoMot or axclualon of paroanantly failad olaaenca* and 
racording of tha obaarvationa and actiona takan thua far. Tha fiaal 
atap la aithar a raatart of normal oparationa, or tha complation of tha 
al■>it ^'M aequanca. Both hardwara and aofttiara tachniquaa hava baan 
d«'M:«ad to inplanant thaaa functlona. Thay ara diacuaaad in nora datali 
in tha following aaction. 


2.3 PAULT-TOLQIANT SYSTDIS 

Tha ultimate proof of tha ef factivaneaa of fault-tolaranca 
techniquea is found in the performance of exiating ayatama. For Che con- 
venience of discussion, we make the distinction between fully fault- 
tolerant (or aelf-rapairina ) and manually-controlled systcoiq with fault- 
tolerance features* The former complete their recovery actiona without 
the participation of a maintenance specialist. While the latter depend 
on human decision making aa part of the recovery sequence. These 
decisions may take place at various stages of the sequence, from Che 
initiation of diagnostics to the operation of the switch %diich discon- 
nects a failed part of the system. 

The fully fault-tolerant systems may be further classified 
according to the availability of external ("off-line") repair. In 
closed systems repair is nut available, and the system inevitably fails 
after the redundancy resources have been exhausted. Closed systans are 
usually found in space applications (COOP 76). (AVIZ 71a]. (CONN 72). 

In repairable systems, failed parts are automatically identified and 
excluded fnxn further participation in computing. They are then 
replaced by an off-line repair action. System failures usually occur 
either because of imperfect fault detection and recovery algorithms, 
or because of catastrophic faults (i.e.. faults that cannot be handled 
by the recovery procedures that were provided). A less frequent cause 
of system failure is exhaustion of redundancy, which occurs when faults 
occur faster than the repair pr. ccdure can Itandle them. Very prominent 
examples of repairable systems are the several models of the ESS tele- 
phone switching systems (DOWN 64), (BEUS 69). 



2-17 


Finally* fauU-tolaranca ayatam nay ba flx^-capaclty or 
dagradabla * Tha fomar ara' conaldarad fallad If a alngla apaciflad 
capacity cannot ba malntalnad* nhlla tha lattar ara allonad to go to ona 
or aora conflguratlona of laaaar capacity bafora tha ayatM la ahnt 
down* 


2.3.1 Hardware^ontrolled Recovery Syatenw 

Another claaslflcatlon of fault- tolerant ayatema may be baaed 
on the implementation of the recovery algorithm. Hardware-controlled 
ayatema have dedicated hardware which collecta fault indicatlona and inl- 
tlatea recovery* while aof tware-controlled ayatema depend on apeclal 
progranta to interpret fault indicatlona and to carry out the automatic 
recovery procedurea. The hardware-controlled recovery approach dependa 
on apecial hardware to carry out fault detection and to initiate the 
recovery procedurea. After the exiatence of a properly functioning 
aoftware ayatem haa been asaured, the completion of recovery la uaually 
transferred to aoftware control. It is evident that further software 
systems may be superimposed on the hardware-controlled design, leading 
to a multilevel recovery procedure. A special case of hardware- 
controlled recovery is found in atatically-redundant systems in which 
faults are masked by redundant hardware, and thus remain totally invisi> 
ble to the software. Two examples of such systems are the OAO data pro- 
cessor which used component redundancy and the CPU of the SATURN V 
guidance computer, which used TMR protection (COOP 76], (ANDE 67|. 
Probably the earliest use of TMR (triplication and voting) is found in 
the SAPO computer, designed by A. Svoboda in 1950-53 [OBLO 62 |. SAPO 
also possesses several other fault- tolerance features. Including dupli- 
cation, parity checking, and retry. A separate software-controlled 
recovery system is needed in atatically-redundant systems if they are to 
continue operating afler the first fault escapes the masking effect and 
affect:; The software. 

Dyniunically redundant systems with hardware control usually 
depend on a dedicated hardware module that gathers fault sly.nals and 
Initiates recovery. Different uses of duplexing; and hardware-controlled 


2-18 


I 


II 

•wltchovcr CtfchniqiMS ar« found in tlM nwaory» pounr ot^ly, and pari- 
pharal units of tha 8ATUHN V guidaoea conputar in coalbination witli a 
TMR-protactad aariai CPU unit lAMDB 671. Saparata fault-datactlon and 
auitchovar-control units wars uoad for ovary functional unit. Probably 
tha first operational conputar ultti fully harduare-controllad dynamic 
redundancy was the axparlnental JPL-8TAK counter (AVIZ 71al. Intended 
for self-contained multiyear space miaaionaf this computer employs a 
special Tcst-And-Repair-Processor (TABP) module to control recovery and 
self-repair. Software assistance is invoked only to perform memory 
copying and to resume normal operation after self-repair. The French 
HECRA computer is another early experimental design (HAIS 71]. A few 
other hardware-controlled system designs that have not reached operation 
have been described in recent literature [AVIZ 75a], [C(MN 72]. An 
interesting recent experiment is the C.vmp multiprocessor, which can 
operate in a fault-tolerant mode as a TMR configuration of DEC LSI-11 
computers (SIEW 77). , t 

The principal advantage of hardware-controlled recovery sys- | 

terns lies i.n their independence of the operation of any software immedi- i 

ately after the fault has occurred. The recovery process is transferred 
to software only after its ability to operate has been assured. The 
relatively late appearance of such systems may be attributed to the need 
to introduce the recovery module into the design at its inception, 
thereby requiring an early commitment to the hardware-controlled 
approach . 

2.3.2 Software-Controlled Recovery Systems 

The software-controlled recovery systera depend on special 
programs to initiate the recovery action upon the detection of a fault. 

Fault signals are obtained by both hardware and software methods; for 
example, parity checkers, comparators, power-leve} rA>nitors, watchdog, 
timers, test programs, reasonableness checks, etc. The main limitation 
of these systems is the need for the recovery s.iitware to remain opera- 
tional in the presence of faults, since recovery canntft i>therwl-it‘ hi- 
initiated. A significant advantage of llte sof tware-cont rolled .ippro.nl> 




is that axistlng **off-chs-slMlf'* hsrdsars ayscan nodulas aay ba usad to 
aaaaobla faulc-tolaranc orgaolaatlona. Thaaa oodulaa coataln varloos 
forna of hardaara fault datactlon* which usually ara aupplaoMtitad by 
further aoftwarc uathoda. For thia rasaon aoftwara-controllad ayataaa 
appaarad aarliar and ara currantly baing usad in nuMroua applications 
raquiring high raliability snd availability. Hhila ovary modem oparat* 
ing ayatam incorporacas sono racovsry faaturas* thia raport is liaitod 
to aalactad illustrations of historically iaq»ortant and advanced systans. 

An important early design of the 19S0's that had complete 
duplication and extensive recovery provisions was the SAGE system 
(EVEK S7]. The IBM SystoD/360 architecture contains very complete 
serviceability provisions for muHi^system operation in order to attain 
higti availability t reconfiguration, and failsoft operation (CART 64). 

An early exampla of a multi-system which includes further extensions ot 
the System/360 design is the IBM 9020 multiprocessing system foL air 
traffic control applications (IBM 67). Noteworthy are the operational 
error analysis program and the diagnostic monitor of the 9020. An 
interesting illustration of extensive use of backup storage and dynamic 
reconfiguration in a general-purpose time-shared system is found in the 
NIT Multics SystOB (CORB 72). The Pluribus is a minicomputer/ 
multiprocessor system (with extensive fault-tolerance provisions), 
which serves as a switching node in the ARPA Network (KATS 78). The 
TANDEM system is a recently announced comswrcial multiprocessor system 
with software-controlled fault-tolerance (TANO 76). 

Another direction of software-controlled system developiMnt 
is found in aerospace applications. Representative illustrations of 
this approach are the SIFT design, (VIENS 78) the C.S. Draper Laboratory 
Symmetric Multiprocessor (HOPE 78) and the COPRA system, (HERA 76) aii 
of which are in design and developo^nt stages. An already operational 
four-computer fault-tolerant complex is the U.S. Space Shuttle computer 
system (C(K)P 76), (SKLA 76). 

One other area ot application which requires fault-tolerant 
operation and very high availability for several years of continuous 
operation is the control of electronic telephone switching systems. 


2-20 


These systems usually employ manual repair by replaconent of a failed 
part as the last (off-line) step of the recovery procedure* trtiile main- 
taining normal operation by means of the remaining system modules. A 
well-documented illustration is found in the Electronic Switching Sys- 
tems (ESS) of Bell Telephone Laboratories. The ESS designs use several 
hardware techniques (duplication* matching* error codes* and functional 
monitors) and special software (check routines* diagnostics* audits), 
as well as boftware and hardware emergency procedures when normal 
recovery action does not succeed (TOYW 78), [BBHS 69). The Plessey Sys- 
tem 2S0 is a fault-tolerant multiprocessor syston for switching system 
control [HAME 72). 

2.3.3 Fault-Tolerunt S:'bsystem8 

Besides the complete systons discussed above, many efforts 
have been carried out to provide fault-tolerance for functional subsys- 
tems, which then can be assembled to form a fault- tolerant system. This 
is especially true for secondary and mass storage which has been charac- 
terized by relatively low reliability in the past. Representative error 
coding applications include the use of codes for error control in data 
communications, magnetic tape units, disc files, primary random access 
storage, and a photo-digital mass store [TANG 69]. Single-error correct- 
ing codes are used in the control storage of the No. 1 ESS [DOWN 64], 
the main and control storage of IBM System/370 computers, and several 
other semiconductor memory systems. Error correcting codes have proven 
to be a very effective method for fault-tolerance in the storage medium, 
and the remaining problems exist in protection of the memory access and 
readout circuitry. These have been investigated in an experimental 
design [CART 76]. 

Recent studies have considered the problem of fault-tolerance 
in associative memories and processors [PARH 74]. In general, processor 
fault-tolerance has been provided by duplication and reconfiguration at 
the system level. Investigations have been conducted on the use of 
arithmetic error codes to detect errors caused by processor faults 


2-21 


iAVlZ 71b] and an exparimental procesaor haa been designed and con- 
structed for the JPL-STAR computer [AVIZ 71a]. Continuing reductions in 
the cost of processor hardware make further emphasis on duplication or 
triplication [W)PK 78] very likely* although error-detecting codes 
remain a convenient method for the identification of the faulty proces- 
sor in a disagreeing pair. An exception is found in large scientific 
computers with multiple arithmetic processors, in which replication is 
not practical, and graceful degradation procedures must be employed 
[AVIZ 77a]. A potentially very effective approach to error detection in 
integrated circuits of processors is self-checking logic design 
[CART 74], [WAKE 74]. 

2.4 MODELING AND ANALYSIS 

The choice of fault-tolerance functions and redundancy tech- 
niques needs to be supported by an assessment whether the system 
possesses the expected fault-tolerance. Insufficiencies of the design 
may be uncovered, and the design can be refined by changes or additions 
of various forms of redundancy. There are two approaches to the evalua- 
tion of fault-tolerance: 

(1) The analytic approach, in which fault-tolerance mea- 
sures of the syston are obtained from a mathematical 
model of the system. 

(2) The experimental approach, in which faults are 
inserted either into a simulated model of a system, or 
into a prototype of the actual hardware, and fault- 
tolerance measures are estimated from statistical 
data. 

The principal quantitative measures of the effectiveness of 
fault-tolerance are reliability (with respect to permanent faults) and 
survivability (with respect to transient faults) [AVIZ 75a). Methods 
for the prediction of these measures are discussed in this section. 


2-22 


2.4. 1 Analytic Modeling: Permanoit Faults 

A quantitative reliability prediction for a ayston requires 

the knowledge of numerical failure rates of the ccmponents, idiich are 

given in failures/hour and are usually assumed to be constant. If 

technologies which are under developnent are to be used in a new design » 

the failure rates need to be extrapolated or predicted analytically. 

Different and possibly time-dependent failure rates may apply to some 

classes of failures, such as those causing distributed faults. The 

reliability R(t) is a function of the failure rates and is defined as 

the probability of the survival of the functional capabilities of a set 

of hardware elements up to the time t, given that all hardware was in a 

perfect condition at the time t = 0. For a non-redundant system and 

-Xt 

constant failure rates, the reliability is e , where X is the 

sum of the failure rates of all components (system A of Figure 2-1). 

In this case, all components have to survive up to the time t. Fault- 
tolerance of the system is attained only if correct program execution is 
maintained by the surviving hardware; for this reason the survivability 
with respect to transient faults must also be considered in a complete 
evaluation. 

A very common quantitative measure used to compare two or 
more different designs has been the MTTF (mean time to failure), defined 

as MTTF = J H(t) dt. Given the non-redundant system reliability 

-\ f ^ 

R(t) = e , we have MTTF ® 1/X and the comparison of several MTTF's 
directly compares the failure rates (X) of the competing systems. Vfhen 
redundancy is introduced, the reliability function R(t) becomes a poly- 
nomial in e (e.g., system B in Figure 2-1) and the R(t) curves of 
systems being compared may have crossover points. In this case, the 
area under the curve does not Indicate which system is better for a 
given time interval, and the MTTF may become a misleading measure. 

Two more precise measures of comparison are illustrated in Figure 2-1 
and are discussed below. 

Given a fixed "mission time" T, for which the highest reli- 
ability is desired, the comparison of two systems requires only the 
values of h’^(T) and in order to select the best system. The 


2-23 



Reliability Improvement Factor is defined as RIF ■ (1 - - Rg) at 

the specified mission time T, and it serves as a measure of improvement 
attained by using the "B” system [AMDE 67). When a fixed mission time 
is not specified, the Mission Time Improvement Factor (MTIF) serves as a 
convenient comparison measure [BOUR 69]. It is defined as 

MTIF = (Tg/Tj^) at R^j^ ^ specified reliability (e.g., .99 

or *90), while and are times at which the system reliabilities 
and R^(t), respectively, fall to the value 

We observe that reliability modeling remains useful even if 
specific numerical failure rates and mission times are not given, since 
it still permits the relative comparison of many competing designs* The 
failure rates are normalized with respect to a reference measure of 


2-24 


complexity and the MTIP Is used as the criterion of quality* The most 
fundamental difference In computer reliability modeling Is that between 
stat^ and d^rnMi^ models for the reliability of systems which Incorpo- 
rate protective redundancy, ftoth classes of models are considered In 
the following discussion. 

The class of static reliability models Is suitable for the 
reliability prediction of systems with static hardware redundancy. The 
redundant elments are assumed to be permanently connected and to fall 
statistically Independently. They have the same failure rate and are 
instantaneously available to perform the masking of a failure with unity 
probability of success. Under these assumptions, the reliability of a 
redundant system is obtained as the sum of the reliabilities of all 
distinct configurations that do not lead to system failure. Reliability 
models of static redundancy are found In handbooks and textbooks of 
reliability theory and are used for reliability analysis of various 
redundant structures, e.g. . relay contact networks, aircraft frames, 
etc. [BARL 63]. The principal limitation of the static model in conqiu- 
ter reliability modeling is the assumption that the fault-masking action 
is always successful as long as redundancy is not exhausted. This 
assumption cannot be Justified in systems which employ various forms 
and combinations of dynamic hardware, software, and time redundancy, 
and dynamic reliability models have to be created to these systems. 

The use of dynamic redundancy requires the success of con- 
secutive fault detection and recovery actions in order to utilize 
redundant (spare) parts. The use of static reliability models for the 
dynamic case is equivalent to assuming unity probability of success of 
both actions. For this reason, very high reliabilities are predicted 
as the number of spares is increased. Early in the studies of dynamic 
redundancy it was recognized that imperfect detection and recovery may 
cause system failure before all spares had been used. The effect of 
such imperfections was formalized in the dynamic reliability model 
through the concept of "coverage," defined as the conditional probability 


of miccMofttl rocovory* given that a fault has occurred (BOUh 69 1. 

This eodel has served as the reference point for aubs^uent investiga- 
tions of closed systMM, i.e.% of those systems in which off-line repair 
of failed parts is not available, and the system is certain to fail after 
all redundancy resources have been exhausted. 

Recent research, has resulted in a general dynamic reliability 
model which es^loys Markov modeling techniques and subsumes nearly all 
models for both static and dynamic redundancy that have been developed 
to date (NGYW 76, NGYW 77a]. Its principal advantage la that a single 
efficient computing procedure serves to perform the reliability predict 
tion for any one of a variety of closed systems, including those in which 
degradation is provided. Extensions to repairable systems and to tran- 
sient faults also have been siade in this model. 

A closed fault-tolerant computer syston Is treated as a set 
of homogenous closed subsystems, each of which consists of a set of 
identical modules that are either in active or spare status. "Active" 
means "participating In the computing process," i.e., a powered spare is 
not active, although its failure rate is the same as that of the active 
modules. Since every subsystem must survive In order for the system to 
survive, the system reliability is the product of the reliability of all 
subsystems. The modeling effort therefore deals with a closed homogen- 
eous subsystem. The set of modules forming such a subsystem is character- 
ized by the following parameters: 

N “ Initial number of modules In the active configuration 

D * Number of degradations allowed In the active configuration 

S ■ Number of spare modules 

Ca • Coverage for recovery from active module failures 
Cd - Coverage for recovery from spare module failures 

X - Failure rate of one active module 

H - Failure rate of one spare module 
(m - X If spare is powered) 


2-2h 


X • Sequmctt of allowed degrade cions of Che active 
configuration 

cy • Coverage vector for degraded configurations 

The parameter X Integer vector of the form T ■ (T(l)» ..*• Y(0]), 

where Y(l]» ...» Y{D] are the numbers of active modules remaining In 8uc> 
cesslve degraded active configurations. The coverage vector 
form CY ■ (CY[1), ...» CY[D])» in which CT[1] is the coverage associated 
with the transition to the degreded configuration described by Y(l). 

At any given time each module Is In one of three possible 
states: it Is in the failed state; It is a good spare (all spares are 

either powered or unpowered) ; or It Is a member of an active configura- 
tion which consists of all those modules currently participating in the 
coag>uting process. Once a module has failed and the system recovers 
from the failure (either through static fault-masking or dynamic recon- 
figuration), it is assumed that the failed module is isolated from the 
system and will no longer contribute to system reliability or unrelia- 
bility. This implies that the possibility of compensating failures in 
voting systems and similar secondary effects are not considered in this 
model. 

In a dynamically redundant subsystem, an active configuration 
of N modules is supported by a bank of S spare modules. When the spares 
are exhausted and one more failure of an active module occurs, the sub- 
system is usually considered as failed. However, in some applications 
it continues to operate in a degraded mode, i.e., it has a smaller set of 
active modules (and hence a possible degradation in performance). The 
abandonment of active modules upon failure continues until the active 
configuration falls below a specified minimum number of modules, at 
which time the subsystem fails. The degradation sequence is described 
by the vector Y in the reliability model. Statically redundant subsys- 
tems and hybrid- redundant subsystems with a static core also have an 
active configuration which degrades to some extent before subsystem 
failure occurs. (For example, a TMR subsystem degrades from 3 to 
2 modules upon the first failure). Hence they are treated in the 


2-27 


reliability model in the aame mamier aa the dynamically degrading 
aubsyatema. 

The condition of a cloaed anbayatem ia characteriaed by a 
model %d.th a finite number of atataa* each repreaenting a diatinct aub- 
ayatem configuration \dtich ia either good or ia failed. For cloaed 
aubayatema, the goal ia to obtain the atatiatica of the time to flrat 
occurrence of aubayatem failure. Hence* all failed atatea are merged 
into one atate denoted by F. A tranaition out of a good atate takea 
place when a failure occura in one of the modules. Depending on whether 
recovery from thia failure ia aucceaaful* the tranaition will be to 
another good atate or to the failed atate. When it ia assumed that 
failure rates are constant and that (with respect to the tisie scale of 
reliability prediction) the recovery from a failure is accoapliahed 
instantaneously, the model is a Harkov model. 

The state diagram of Figure 2-2 is the model of the closed 
fault-tolerant subsystem which is defined by the set of parameters (N, 

D, S, Ca, Cd, X, li, X> £X) explained previously. The subsystem is self- 
repairing and has provisions for degradation of the active configuration 
after the spares b .ve been exhausted. The selection of sparesoccurs in 
a linear order, .d a spare that fails in an unrecoverable modr blocks 
the use of the spares thac follow it in the selection sequence. Further- 
more, it destroys the ability to degrade, because the subsystem fails at 
the time when the unrecoverably failed spare unit is switched into 
service. This effect is incorporated in the model by transitions to 
the states with an overbar such as (N,S-1), (N,0), etc. The subsystems 
in the state (N,i) and in the state (N,i) have the same configuration, 
but the subsystem at state (N,i) has lost its ability to degrade because 
of the existence of a non-recoverable failure in one of the (still 
unreached) spare modules. 

Almost all fault-tolerant system models that have been 
studied in the past can be represented by this model. Table 2-1 
characterizes several of them in the notation described above. 

The reliability equation of one closed fault-tolerant sub- 
system has the form: 


2-28 


N-3,0 H— : — H N-2,0 



Markov Reliability Model for Closed Systi 



Table 2-1. Charaetarlsatlon of Savaral Ifodala 


i 


i 

) 

j 

i 


, i 

\ ; 


f 

1 


of Fault-Tolaranc SyataiM 


Syatan 

D 

H 

S 

Ca 

Cd 

A 


Y 

Reformca 

Sliwlex 

n 

0 

0 

1 

1 

X 

X 



Static 










TMR 

3 

1 

0 

1 

1 

X 

X 

2 

(BOUR 71] 

TMR/Sifflplex 

3 

1 

0 

1 

1 

X 

X 

1 

{BOUR 71] 

NMR 

2irM 

II 

0 

1 

1 

X 

X 

2n. . * . ,iH'1 

[MATH 7Sa] 

NMR/Simplex 

2itf1 

n 

0 

1 

1 

X 

X 

2n*1 f • • • 1 

[HATH 75b] 

Dynamic 










R** 

C 8 

q 

0 

s 

c 

c 

X 

y 


[BOUR 69] 

R*(N,S.A®,aJ) 

N 

0 

s 

A^* 

c 

a“ 

c 

X 

u 


[RENN 73a] 

K-out-of-N 

N 

N-K 

0 

c 

C 

X 

\ 


[WYLE 67] 

R(2,S) 

2 

1 

s 

1 

1 

X 

y 

1 

[RENN 73b] 

hybrid 










H(N.S.D) 

N 

D 

s 

1 

1 

X 

y 

N-1.....N-D 

(BRIC 73] 

R(N,S) 

2n+1 

n 

s 

1 

1 

X 

y 

2n. • • • .n^" 1 

[MATH 70] 

*^*TMR/Spares 

3 

2 

s 

1 

1 

X 

y 

2.1 

[TAYL 73] 






X(t) • 4 • «(t) 


B(t) 


i4Mre: 


X(t) 

m) 


Ths coefficients A 


rt.h rioi . » 


( 1/10 •••••# / 


A • 




*5,5 


S,J 


S,0 '*5,5 

in the matrix 4 are functions of the parameters 


and are computed by an algoritlm given in Table 2-2 [NGYW 77a] . 


Based on the model described above, the UCL4 Automated 
Beliability Interactive Estimation System <AEIBS) has been inqilemented 
in APL as a set of interactive prograias for the modeling of fault- 
tolernnt computers (NGYW 77b]. Generality and efficiency are achieved 
in ARIFS because it is based on the unified solution to the reliability 
modeling problem. To achieve flexibility, the user is provided not only 
with functions for evaluating the reliability measures of interest, but 
also with programs ro create, modify and examine representations of the 
systems which are being designed. 


The Markov model for closed systems shows that their 
reliability equations have the standard form: « 


R(t) 


I 

i 



t 


where the are simple functions of the modeling parameters and the 
can be efficiently computed. By applying Markov modeling techniques to 
repairable systems, the same standard form for their reliability equa- 
tions is obtained, but now the a. must be computed as eigenvalues of the 
transition probability matrix of the Markov model and the k* need a more 
general and less efficient procedure. The reliability analysis of both 


2-31 


8c«p (1) Stare with • 1* Oo to Stop (2) if D > 0. 

For I ■ 1 to 0, Itarata the following coapwtationot 


j Cr{t> • I • y[D I) • A* 
V ■ IlO - II - 1[D - J] * 


for J ■ 0, ...» 1-1 


r T 

. 1 - I 


Stop (2) Sec • 1 


Step (2a) 


Uaing resulte of Step (1), aet ^ for 

X * Op • e o p Oe 

For M • 1 to 5p icerato tho Collowiiig computaCiofis: 


(/f£3aX ♦ NCdv) ♦ (1 - Cd) T >>j j 


(W - J)u 


for 0 < J 


I# * ^ * I 

JmQ 


« (NOxX + MCdu)A2,. r 

St., CM 0<K<D, y_, . a . y(«|)> ^ dr!’:;), lovO._J.M 


■ 0 


Step (2c) 


K - 0: 4. 


(NCaX + MCI>u) 4„ , r ♦ (1 - Cd)u I A. , 
(M - J)u 


for 0 < t? < W 


* (K,J) ? (0.W) 


closed and repairable ayatma haa many emonon propertlea that have 
allowed an extension of ARIES to include repairable fault-tolerant 
aystens as well. 

The repairable systeas nmdeled by ARIES ate the closed 
aystens in the Harkov reliability imidel which have been made repairable 
by the presence of one or more repaimen [NCYW 77a]. Hence, they are 
modeled by the same set of parameters (N, D, S, Ca, rd, X, u, Y, as 
closed systems, plus two more parameters (H, Y), where M is the number 
of repairmen and y is the repair rate of each repairman. 

2.4.2 Analytic Hodeling: Transient Faults 

The next step to be taken in modeling is to address the 
problem of ti anslent faults (NGYW 76]. These cause system failures by 
damaging the information content of the system during their presence. 

This damage will be permanent and will evmt-..ally lead to irrecoverable 
errors in the system unless some means of recovery is provided. Recovery 
in this case consists of a restoration of the Information structure so 
that the system can continue to function properly. The hardware remains 
intact and the full capability of the machine is retained, in contrast 
to permanent fault recovery where the system degrades in performance 
unless spares arc used to replace faulty modules. 

The methods to effect recovery from suspected transient 
faults usually consist of a number of successively more difficult 
recovery phases. For example, a system may use jthe sequence of an 
initial delay, instruction retry, program rollback, and system restart 
as a four~phase recovery procedure. The next phase is entered if the 
current phase fails to accomplish a satisfactory recovery. 

The processes which generate transient faults are difficult 
to characterize. The model adopts the viewpoint that the transient 
fault environment can be characterized by two fundamental parameters— 
transient arrival rate and transient duration (AVIZ 73a]. It is 
assumed that transient arrival is a Poisson process with a constant 
arrival cate and that each transient fault has a duration which is 
independently distributed according to an exponential law. These 


2-J) 


MtumptioM appear to ba coaalacaoe with tlia lialtad nunbar of obaarva- 
tlona on tranaianta available in currant lltaratura and have the advan- 
tage of being oora readily aatbanatically tractable than other poeaibla 
choices. The two parMietora aodeling the transient fault enviroiwent 
under these asswiptions are defined as follnwst 

T ■ transient fault arrival rate for one module 

D ■ mean duration of each transient 

A transient recovery process may fail because of several reasons. The 
model deals with four causes of failurex one is excessive duration, 
which is a function of t and D, while the other three are characterised 
by the parameters recoverability r, effcctivmess^ E, and interference 
rate p. All four are discussed below. 

The first cause is occurrence of persistent transients . They 
are transients that last throughout an entire phase of a recovery action, 
causing that phase of the recovery action to fail. A very long transient 
will lead to unsuccessful outcome of the entire transient recovery 
effort. Then the transient fault will be treated as permanent by the 
system and permanent fault recovery actions will be initiated. The 
probability of a persistent transient depends on the arrival rate t and 
mean duration D of transient faults. 

The second cause is a catastrophic fault . Such a fault 
occurs when the transient fault damages insufficiently protected critical 
information. Also, faults that are not detected soon enough after their 
occurrence can lead to so much information damage ("memory mutilation") 
as to make recovery impossible. Furthermore, real-time systems have 
certain tasks which must be accomplished within strict time limits. 

Delay of these tasks by a transient fault also may lead to a system 
crash. The probability of these and other possible catastrophic faults 
is modeled by the recoverability parameter r which is defined as the 
conditional probability: 

r Prob (fault is not catastrophic { fault occurs). 

Since the effects of both permanent and transient faults arc similar in 
most systems and are .ibout as likely to cause catastrophic failures, one 
value of r Is used to model both tyi •■-s of catastrophic faults. 


2-14 


Tha third cause of rasuccessful recovery for a given recovery 
phase is the Ineffectiveness of the recovery tectoieua eneloved * For 
exanqple» it has be«i estinated that instruction fetry as e transient 
fault recovery teclmique is effective only in a^roxinately half of ell 
cases (C&l^ 64). fhe effectiveness of a particular recovery fdiase is 
modeled by the effectiveness parmeter B which is defined as the condi- 
tional probability: 

B = Prob (recovery action is successful | it is initiated 
against a noncatastrophic transient fault). 

The fourth cause is interference which occurs when a second 
independent fault (transioat or permanent) interrupts the function being 
performed to effect a recovery. How the syeton behaves in the presence 
of such interference depends on the recovery capability that is built 
into a system. A conservative assumption is that interference, like 
catastrophic failures, will always lead to a systan crash. The prob- 
ability of interference depends on the duration of recovery, and on* the 
coiq>lexity of the recovery elements that must remain fault-free in order 
to carry out the recovery action. The latter is modeled by the 
interference rate , defined as: 

p i failure rate of the recovery element hardware. This 
hardware Includes both dedicated recovery hardware elements and those 
elements that are used to store, deliver, and execute recovery software. 

Given the preceding parameters, transient fault recovery 
can be modeled as a part of the general model; that is, it is also 
modeled on a subsystem basis, since each subsystem may have different 
recovery requirements and a separate recovery strategy may apply for 
each. . The recovery strategy is a multiphase recovery process which 
executes n successive recovery phases, as shown in Figure 2-3. Transi- 
tion to the next phase takes place if the present phase is not effective. 
The recovery process is completed and normal processing resumes if a 
successful recovery is achieved during the present phase. The system 
can crash during the present phase due to interference. (.' 'crash'* is a 
failure of the programs to continue correct execution.) If neither a 
crash nor a recovery occurs in all n phases, then the transient recovery 


2-35 



Figure 2-3. Transient Fault Recovery Process 


process is considered to have been unsuccessful (because the fault 
pe" ists) and permanent fault recovery is initiated. 

The model employs (for i •• t, .... n), the following condi- 
Lionel probabilities: 

PE^ * Prcb (system enters i-th recovery phase | fault occurs) 

PR^ * Prob (system recovers in i-th recovery phase | fault 
occurs) 

PF^ ■ Prob (system crashes In i-th recovery phase | fault 
occurs). 

The sequence of events in a transient fault recovery process Is depicted 
In Figure 2-3, which shows its three outcomes. They are parameterized 


by tlM following thrM conditional probabilitintt wbieh apply to tha 
tranaiant racovary procaaat 

n 

C. (§ Tranaiant Covaraga) • £ PR. 

^ i-1 ‘ 

■ Prob (Tranaiant racovary auccaada | fault occura) 

(5 Laakaga) - PB^^^ 

Prob (Fault ia traatad aa pacnanent | fault occura) 

n 

F (= Probability of a craah) “ (1 - r) + I PP. 

^ 1-1 ' 

- Prob (System crashes during recovery ) fault occurs). 

Because a system usually cannot Immediately distinguish whether a 
detected fault Is transient or permanent. It is assumed that the tran- 
sient fault recovery is the first process initiated. This asswaptlon. 

Is reflected in the definition of the above parameters by making them 
conditional on the occurrence of any fault, transient or permanent. 

The parameters C^, L^, and give the relative probabilities of the 
three possible outcomes of the transient recovery process and thus 
determine the reliability of the system in the presence of transient 
faults. To complete the choice of modeling parameters, it Is necessary 
to define: 

- Effectiveness of the l-th recovery action 

Tj^ - Duration of the i-th recovery action 

In the general case, is a random variable. In order to 
limit the complexity of the model, the assumption is made that it is a 
constant, which would be an upper bound. It is also postulated that the 
first stage of any recovery strategy is an intentional delay of duration 
Tp, in order to allow the transient fault to subside. Then Tj ■ T^ and 
Ej ■ 0 since thore will be no active recovery action during the delay 
(recovery phase I). The transient reliability measures C^, L^, and 
are computed from the basic parameters of the subsystem by the use of 
some simple probability relations, as shown in Table 2-1. 


2-37 


Table 2-3. Derivation of Tranalent Reliability Meaaurea 


• PF^ • Prob (plwM i Is eatsrsd) « Prob (Intsrfsrsncs In ^ss i) 

-f»r. 

- Pfff « (I - • ) 

a ■ Prob (phase i is entered) « Prob (fault is a transient) 

» Prob (recovery action is effective) » Prob (no interference) 
K Prob (no recurrence of fault in phase t) 

K Prob (duration of transient does not extend into phase t) 


PR. . t. . . f I ] 

% % % % ^ I 


e . PS. -PP. -PP.; Pff, 


The factor • Prob (fault is a transient | phase t) is a probability conditional on 
entry to phase t of the recovery process and decreases as i increases* The reason is 
that with increasing knowledge that the fault has not been eliminated by the preceding 
recovery mechanisms* there is more likelihood that it is a permanent instead of a trans- 
ient. To estimate define 


• Prob (recovery phase t is entered | fault Is a permanent) 

• Prob (recovery phase i is entered j fault is a transient) 


We assume that is* a catastrophic fault will not cause entry to phase 1* 

but will enter the system failure state Immediately. The following relations also hold: 


Then 


-pr. 




A^ X Prob (System does not crash In phase t) • A^e ^ 




X Prob (System does not crash* but there is no recovery in phase :*) 


-Pf, 


- ^ e 


I.-.,/'"'"' Ml 


♦ tP. 


where t are respectively the permanent and transient failure rates of one subsystem 
module. 


2-38 


Thtt reliability model of Figure 2~2 does not include tran- 
sient fault recovery. This limitation Is removed by integraelng tbe 
transient fault recovery SKMlel Into the unified model tbSttf 7b] • Fig- 
ure 2-4 shows* on a localised basis, the lncorporatl<m of the transient 
fault recovery model Into the reliability model of Figure 2-2. TWo 
additional states are Introduced between each pair of successive oper- 
ational states of the subsyst«s to represent the existence of the tran- 
sient and permanent fault recovery processes. In addition to the original 
set of parameters, transitions between states are also governed by the 
three transient fault recovery parameters: and F^. It is 

assumed that transients have no effect on the status of spare modules; 
hence the transitions between states that are caused by spare module 
failures remain the same. Although the system spends a finite amount 
of time in these two recovery states, for all practical purposes it can 
be assumed Chat the recovery process is instantaneous, because even in 
the worst case the recovery time is several orders of magnitude smaller 
than the average time between faults in the hardware. With this assump- 
tion, Che two recovery states are merged into the operational states and 
Figure 2-4 becomes Figure 2-5. The general model of Figure 2-2 is pre- 
served i^en the transient fault recovery model is incorporated. The 
main effect of this incorporation is to change the effective failure 
rate of each module from A to A* and the effective coverage factor from 
Ca to Ca* as given in Figure 2-5. The derivation of A* and Ca* follows 
from Figure 2-4. 

Because the general model of Figure '>2-2 is preserved, the 
same efficient computational procedure can be applied in those cases 
when transient fault modeling is desired, with the obvious modification 
chat A and Ca must now be replaced by A* and Ca'. The programming sys- 
tem ARIES has been extended to model transient fault recovery. Based 
on a characterization of transient fault recovery in a subsystem by 
means of the parameters t, D, p, r, E^, T^ and T^^. ARIES estimates the 
transient fault recovery parameters C^, Uj, and F^ from which an efficient 
reliability estimation of a subsystem is mixed transient and permanent 
fault environments can be made (NGYW 77b]. 


2-39 




2.4.3 Hauristic ApproacliMt SlnwlatioB and Ixpatlmaata 

Simulation and axparlmantatlon with a hardwara prototype are 
too approachea to heuristic prediction of reliability. Althoogb tbair 
use Is more costly and time-consuming than that of analytic models » these 
methods are essential when the analytic models do not adequately repre- 
sent the complex structure of the syetoi or the nature of the expected 
faults. Furthermore* the users of systons In various failure-critical 
applications often Insist on heuristic validation of the Initial analytic 
results prior to the production and use of a system. 

An accurate description of the system and detailed 
characterization of faults are the principal prerequisites when simula- 
tion is employed to derive the reliability estimates for the'coiq>uter. 
Modem simulation programs include provisions to model both permanent 
and transient faults, and tc> consider the hardware-software interaction 
by representing a variety of recovery algorithms (LEVY 75]. An impor- 
tant early use of simulation was the reliability prediction of THR logic 
in the SATURN V guidance computer [ANDE 67]. 

Experimental reliability prediction using a hardware 
prototype requires a large investment of effort in constructing the 
prototype, but avoids the inaccuracies irtiich may occur in postulating 
the fault effects in a simulated model of the system. An example is the 
experimental fault-tolerant JPL-STAR computer. In this computer an 
electronic "black box" was used to inject faults of adjustable duration 
and extent at selected points in the hardware of. the system during its 
operation (AVIZ 72]. Statistical data on the cases in which recovery 
did not succeed was automatically collected and processed. The data was 
also used to derive estimates of the coverage parameters for analytic 
modeling. Several weaknesses in the fault-tolerance Implementation of 
the original design were Identified and eliminated during the experi- 
ments. The stability of recovery algorithms was studied under multiple- 
fault and repeated-fault conditions, and the performance of system 
software was extensively tested. 

The current rapid advances in the design of novel and 
complex fault-tolerant systems have overtaken the capabilities of 


2-41 


i 

I 

t 


analytic modaltas* Aa a oonaagnonco, axparlnantal rallakillty pradlo* 
tlon raoiaitta a vary Important araa for furttiar 4avalopoiant «id 
application. 

2.5 T0LE8AKCB OF MAS-MAOS FAULTS 

Man-made faults ara all non-i^slcal faults that occur 
because of human mistakes, i.e. . enecution of in^ropar actions or absence 
of expected actions during the procedures of specification, design, 
detailed implementation (construction or programming), modification, 
maintenance, and use of information processing systems. They do not 
include physical faults that are consequences of human actions. The 
manifestations of such physical faults arc the same as those caused by 
natural phenomena; for this reason they are treated by the same techni- 
ques of fault-tolerance and belong in the same category as all other 
physical faults. Man-made faults include the non-physical faults caused 
by imperfections in various design, programming, and maintenance tools, 
such as compilers, assemblers, design automation programs, maintenance 
and operation manuals, testing procedures and devices, etc. 

For the purpose of systesiatic discussion, it is convenient 
to partition man-made faults into the classes of design faults and inter- 
action faults. Design faults are the faults that are introduced into 
the system during various phases of implementation: Specification, 

design, prograimoing, translation to machine code, detailed logic design 
and layout of logic circuits, interconnection of hardware elements, and 
later modifications of hardware and software. The causes of design 
faults are twofold: incomplete, ambiguous, or erroneous specifications, 

and mistakes committed during the various phases of translation of a 
specification into the final implementations, i.e.. assemblies of inter- 
connected hardware elements and arrays of digitally represented symbols. 

Interaction faults are faults that are introduced into the 
system via man/machine interfaces during operation or maintenance phases 
by operator action that is not appropriate to the current state of the 
system. They are caused typically either by a misunderstanding of the 
operator's manuals or by typographical errors that occur while informa- 
tion is entered into the system. 



2-42 


Tha problem of man-medo faelte liae fOBained of conolatently 
great concern to the dealgnera end naere of iaformtioa processing sys- 
t«B8 from the specification of the first system to the present. Some 
complex and costly aystona have never reached operatic condition 
hecauee design faults could not be eliminated or controlled (tolerated) 
within the existing time limits and cost constraints, ttauay other syst«BS 
have experienced severe delays in delivery and major cost overruns, in 
a faw cases the question of the possible existence of latent design 
faults has spilled over from technological and economic considerations 
into politics and public controversy. A very prominent illuatration of 
such an event is the recent controversy in the U.8. regarding the possi- 
bility of unreliable behavior of the ABM (anti-ballistic missile) 
defense computer system. 

In contrast to physical faults, the problems of man-made 
faults have not been suddenly alleviated by a major technological break- 
through similar to the invention of semiconductor and magnetic core* 
components. Advances in the understanding and ability to handle man-made 
faults have come at a slow and steady rate, and they have barely kept 
pace with the rapidly growing complexity of systems and the increasing 
demands for near perfectly fault-free system behavior in numerous 
critical applications, in some of which human lives are endangered by 
fault-induced system failures. 

2.5.1 Design Faults 

An overview of the approaches used to handle faults from 
the origins of machine computing to the present shows that a priori 
fault elimination (fault-avoidance) has been the dominant choice for 
the handling of design faults that are introduced during specification, 
design, construction, programming, and modification of both hardware 
and software (ICRS 75], [NELS 75]. An all-out effort to eliminate 
design faults takes place before the system is first put into regular 
service or returned to use after a modification. 

The approaches taken to assure design fault elimination have 
originated both in theoretical studies and in problem-solving approaches 
developed from direct experience. The main theoretical developments in 


2-43 


chi* arM af« proof-af-oarractnasa tactei^aaa CUniD 7S] auA oathamatical 
nadala far aaltwara raliablUty pradletlaa C^OO 73]» IMORA 75)« Ttia 
practlca-arigiaatad "aafMara wgiaeariag" taelmiqaaa iaclada praoadiaraa 
for tha eallaatloa aad aaalyaia of fault 4ata( Maaogaaaat praeadaraa for 
aaftwara davaloparatt toola aad tactoiquaa far aaftwara daolga* ouch aa 
opacification languagaa aad tha otrueturod prt^ranniag approach* aaftwara 
varification aad validation techai^uaa (MBL8 75] » aad dipital*logic aiani- 
latioa tachniquaa for hardwara daa^ varificatioa {ttYG 76]* CBUTL 7d). 

Deapita all of the above tecimiquee for fault ali»ination* 
left-over deaign faulta have bam Obaarvod la moat ayataon during oper- 
ation. For thia reaaon moat ayotema have been provided with emergency 
procedurea to detect error atatea that may be due to deaign faulta* to 
record than, and to bring the ayatea to a etate in which external 
aeaietence may be brought in to complete the oialyeia of the cmidition 
and to reinitiate operation. While theee emergency procedurea are not 
unlike Bome fault-tolerance techniquee for phyaical faults* the function 
that is accomplished is only the **shutdown'* function with respect to 
either a part of the system or the entire system. 

^fore complete fault-tolerance of design faults has not yet 
been introduced into existing computer systems, and only very recently 
have some research efforts been started to explore this problem in 
depth. Because of the existence of much more extensive research and 
practical experience with the tolerance of physical faults, it is inter- 
esting to look for transferability of concepts and techniques. The 
principal difference between physical and design faults is that physical 
faults in hardware occur after the start of the computing process, while 
design faults in software (and hardirare, as well) are present at the 
start, but become disruptive only at a later time. However, modifica- 
tions or corrections of discovered design faults occasionally lead to 
new design faults, and therefore the discoveries of software and hard- 
ware design faults may be expected throughout the useful life of any 
large system, similar to the occurrence of physical faults. This 
practically verified observation establishes a relationship between the 
methodologies for dealing with physical faults and design faults: the 

methods of protective redundancy that have proven successful in the 


2-44 


tolcranca of physical faults nay be transfarabls to provids tolsrancs of 
doslgn faults as wollt Thraa aspscts of ralouaacs of physical fault- 
tolsrancs can ba Idsntlflsd (AVtZ 7Sb)i 

(1) Tha contribution of phyaleal fault-tolsranca tochnlquss 
in idmtifying and isolating dasign faults* 

(2) Tha common aspacta of fault-tolsraacs that ars squally 
rslevant to physical and design faults* 

(3) Ihe transfer of physical fault-tolerance techniques 
and experience of software design faults, conslderingt 

(a) the applicability of software, 

(b) the potential advantages of software 
fault-tolerance, 

(c) the cost of its use, compared against the 
traditional fault-avoidance techniques. 

First, the presence of physical fault-tolerance is directly 
useful in handling design faults because it provides the means to identify 
those cases of abnormal system behavior that are due to phyaleal faults. 
Furthermore, extensions of physical fault-tolerance techniques may be 
applicable to provide hardware-controlled protection of software and 
the data base against attempts to interfere with its operation and to 
access privileged information. 

Second, an area in which a common ground exists for physical 
and design fault-tolerance efforts is the analytic modeling and quanti- 
tative prediction of system reliability. Recent work on software 
reliability models [SHOO 73], (MORA 751 Indicates the possibility of 
mutual reinforcement that would lead to the development of analytical 
models for the total system reliability, including both the physical 
fault and design fault aspects. 

Third, the redundancy techniques that have been successful 
in handling physical faults may be transferable to design fault-tolerance. 
Both the static and the dynamic hardware redundancy approaches have their 
counterparts in software fault-tolerance. In the static case (called 
N-version programming), two or more programs are generated Independently 


2-45 


and than are oparatad coaeurtatly an miltlpla eapiaa of Um fault- 
tolerant liardifara CAVIE 77b] • Conpartaon or majority voting at apacl- 
fiad polnta la anployed to detect or correct the af facta of daalgn 
faults. System such as SIFT [mn 76] , the Symutrlc Hultlprocassor 
[HOPS 75] • and the Space Shuttle Conputar Systan (SKLA 76], are aspaclally 
suitable, for such N-verslon progranmlng. The dynamic case ueev the 
equivalent of standby sparing, in which acceptance taste serve to detect 
design faults and to initiate a switchover to an alternate software 
module [RAHD 75], [HBCH 76]. An extension of the above techniques to 
hardware design fault-tolerance le also feasible: functionally identi- 

cal copies of modules then must be independently designed and manufac- 
tured by separate organleations in order to avoid the occurrence of 
identical design faults in all copies. 

The state of the art in fault-tolerance of design faults 
resembles that of physical fault-tolerance in the early 1960's. The 
cost and the effectiveness of the design fault-tolerance approaches 
remain to be investigated, and tiM techniques require much further 
development and experimentation. The success of fault-tolerance of 
physical faults, however, does indicate very strongly that design fault- 
tolerance cannot be safely ignored solely because of the past tradition 
of fault-avoidance in this field. 

2.5.2 Interaction Faults 

The possibility of introducing man-made faults also exists 
via man/machine interaction during system operation. The control of 
such interaction faults has been implraented primarily by means of 
operator training and by providing complete guidelines in operation and 
maintenance manuals. This approach corresponds to the fault-avoidance 
approach for physical and design faults. The deoiands on the operator 
have been reduced by the development of increasingly more sophisticated 
operating systems. However, Interaction faults have remained a major 
problem area in systeip operation. 

Fault-tolerance approaches to interaction faults have 
remained confined to immediate practical solutions to observed problems. 
The principal goal here is the implementation of the detection function. 


2-46 


idiich alluwa eha tystm to rojoct appormtly ineorroct oporator inputa. 
Tho main nathods ara conaiataoey chacka* raqulraaiaota for appropriata 
paaaworda, and codad data antry* In aooa vary critical caaaa* two or 
more oparatora ara anployad whoaa input coananda and data mwt agraa 
in ordar to ba accaptad by tha ayatan. 

2.6 CURSEMT PROBLEMS AMD PROSPECTS FOR THE FUTURE 

2.6.1 Raaaona tor Fault-Tolaranca 

At tha praaant tiaw» ua can idantify aovaral raaaona for tha 
acceptance and general uae of full fault-tolaranca (without aanual 
Intervention) in infomation proceealng ayatea of tha future. Tha main 
raaaons are: 

(1) The need to mlnimiaa tha rlaka aaaociated with cr'.^tar 
failurea in ayatena in which the failurea either 

a 

endanger human livea* or threaten to cause heavy 
economic loaaes to the uaera. Exaaiplas of the first 
class are systems for patient monitoring in hospitals, 
for air traffic control, and for guidance and control 
of high-speed vehicles. In the second class are 
systOBS to control power generation and distribution, 
to control processes in automated factories, to handle 
financial transactions, etc. 

(2) The need for reliable computing in environments that 
do not allow access for manual^maintenanca, such as 
space and undessei locations, and other locations in 
which access is either Impossible or excessively 
costly. 

(3) The need for almost uninterrupted operation of real- 
time systems in which manual Intervention creates 
unacceptable delays. 

(4) The possibility of lower initial cost (for a given 
reliability goal) than a system that depends on fault- 
avoidance. This may occur In those cases in which 


2-47 



faule-col«r«ncc alloira eka tiaa of loaa coatly coopo* 
nanta* or roducaa tha coat of dasign-fault ollaloatioa 
prior to ayatoB dollvory. 

(5) Tho pooalbility of a lowar liffcycla coat than a 
ayatoo with namial maintananca raqulraaanta. Fault* 
tolaranca can raduca nalntanmca to a achodulod off- 
line raplacaaent of dlaconnactad oloflMnta (or an 
exchange by aalll)« and ellninato the coata eaoociatod 
with the unavailability of a ayatem between failure 
and conpletion of repair. 

(6) The paychological aupport to ayaten uaera provided by 
the knowledge that fault-tolerance ia incorporated into 
the eyetee on which they depend for their aafety or 
economic benefit. 


i 

\ 


2.6.2 A Design Methodology 

Research results and design experience lead us to suggest 
that the Introduction of fault-tolerance can be accomplished by following 
a systematic procedure: 

(1) Performance requirements are established and system 
architecture is specified with the initial assumption 
that faults will not occur. 

(2) Classes of faults that are to be tolerated in the 
design are Identified, and the extent of tolerance is 
specified for each class of faults. 

(3) Cost-effective methods of protective redundancy (time, 
hardware, software) are chosen to cover every class of 
faults identified above, and system architecture is 
modified to incorporate the redundancy. 

(4) Analytic or experimental reliability prediction tech- 
niques ar.' employed to evaluate the fault-tolerance 
ch.iL is provided by redundancy. 



( 5 ) 


Checkout methods are devised to test all redundancy 
features. Where applicable, fault^tolerance is extended 
to effect automatic maintenance of peripheral systems 
that are connected to or controlled by the computer. 

Design experience has shown that several iterations of (3) 
and (4) may be necessary to arrive at a satisfactory fault-tolerant sys- 
tem architecture. 

2.6.3 Current Roadblocks 

In view of the potential benefits of full fault- tolerance, 
it is inevitable to ask: "Why is there so relatively little fault- 

tolerance In Che computer systems of the present generation?** The 
obstacles to the appearance of full fault-tolerance are rather diverse. 
Some of the more obvious problem areas are Identified below. 

(1) Lack of Continuity . Some fault-tolerance techniques 
developed for first-generation computers (for physical 
faults) were discarded in the second generation because 
of much higher reliability of semiconductor and 
magnetic-core components. Later, many ad hoc solutions 
were not openly documented because of their trade 
secret status, leading to the re-invention of good 
solutions as well as the repetition of many mistakes 

of the past. 

(2) Lack of Cost/Benefit Measures .^ Thus far, there are no 
general methods for a convenient quantitative assess- 
ment of the benefits (in terms of life-cycle cost 
reduction) of fault-tolerance. The initial extra cost 
which is due to the various redundancy techniques is 
much more directly evident and tends to bias a large 
class of users (who do not have an absolute requirement) 
in favor of systems without fault-tolerance. 

(3) Lack of Specifications and Acceptance Tests . The user 
community at large still does not have a sufficient 


2-49 


knowl«dg« of the propartios and limitations of fault* 
tolaranca. As a conaaquance, specifications of rail* 
ability are insufficiently precise and virtually 
unverlf labia in advance of system use. For example* 
moat reliability requlr«nent8 for a given time Interval 
do not specify Che classes of faults and do not state 
what constitutes acceptable recovery. For another 
example* HTBF specifications do not explicitly deal 
with fault classes (e.g. * transients* design faults) 
and recovery requirenents, and also ignore the 
differences between redundant and nonredundant designs. 
Extremely high reliability and (fTBF predictions are 
sometimes offered without stating Che implicit assump* 
tions of a static reliability model and a very limited 
class of faults. For contrast, consider speed require* 
ments in instructions/second, which can be stated and 
tasted for acceptance quite precisely. 

Fragmentation of Efforts. Efforts to increase relia- 
bility of computing originate within several disciplines 
of theory and practical computer engineering. These 
include computer system architecture, software engineer- 
ing, testing and design verification, design of data 
base management systems, computer networks and conanuni- 
catlon systems, component and packaging engineering, 
field operation and maintenance, and others. Although 
they all have a common end goal, the efforts have 
remained largely disjoint. A definite lack of a common 
viewpoint and of systematic communciatiun is evident 
at the present time. There is also a real gap between 
the results of theoretical investigations and practical 
engineering solutions to fault-tolerance problems. 

Inertia in the Design Process . Introduction of fault- 
tolerance requires an early committment and a signifi- 
cant departure from traditional evolutionary design of 
computer product lines, in which compatibility of 


software ie usually a dominant factor. While the 
number of fault-tolerance techniques to serve as 
maintenance aids has been Increasing, none of the 
major manufacturers has yet announced a fully fault- 
tolerant line of computers. The only fault-tolerant 
systems that trere actually delivered were custom-made 
products for special requirements. 

(6) Raslstance to Potential Impact . Successful introduction 
of fault- tolerance may cause some de-emphasls of several 
currently flourishing activities. Examples are the 
production of ultra-reliable components, the business 
of providing manual maintenance and the activities 
associated with the a priori verification of software. 

It is not unexpected to encounter skepticism about 
fault-tolerance from the advocates and suppliers cf 
those techniques. 

In conclusion, we note that while most of the above-enumerated 
difficulties are common to many disciplines of computer engineering and 
computer science, they reach probably their greatest severity in the 
studies and implementation of fault-toleranre. 

2.6.4 Goals and Prospects 

The preceding list of problem areas also serves as a guide 
for the selection of goals for research, development and implementation 
of systems. Major goals in fault- tolerance for the immediate future are: 

(1) The development and acceptance among designers, 
analysts, and users of information processing systems 
of an integrated viewpoint of fault-tolerance as an 
attainable and necessary attribute of a good system. 

(2) The development of precise quantitative methods for 
the specification, acceptance testing, and cost /benefit 
analysis of fault-tolerant systems. 


(3) The design, construction, end testing of experlaental 
fault-tolerant systons. Such systems are absolutely 
essential, since they serve as vehicles for the vali- 
dation of new ideas, for the development and refinement 
of performance specifications and acceptance tests, 
and for the education of potential users, proving that 
such systems can be practically delivered. 

(4) Continuing investigations of the n«f frontiers in 
fault-tolerance techniques, especially the tolerance 
of design faults in software and hardware, modeling 
and analysis of complete systems, advanced degradation 
techniques for large systems, and fault-tolerance for 
interaction faults. Another stimulating new idea is 
the possible use of artificial intelligence techniques 
to implement fault-tolerance [GOLD 75]. 

The preceding discussion has shown that fault-tolerant 
computing is still a young, largely unexplored and undeveloped discipline. 
The accelerating progress in both theory and implementation indicates 
that the ability to tolerate a large class of physical, design, and 
interaction faults will be taken for granted in the computer systems of 
the 1990 's, just as the ability to execute a large class of programs is 
taken for granted in the computer systems of today. 


SBCnON 9 

OBJECTIVES AMD ABCHITBCTUBB SELBCTIOH 


The purpose of this section is to describe the asewiptions 
and tradeoffs which led to the selected building block-SCCH architecture. 
Key objectives of the study are: 

(1) to examine and evaluate architectural techniques by 
which fault^tolerance can be incorporated in next- 
generation conqiuter systems; 

(2) to determine requir«nents for VLSI circuitry which 
will be required; and* 

(3) to investigate the feasibility of incorporating fault- 
tolerance as an integral part of future USN building- 
block computer prograias. 

The complexity of modern oillitary systems has led to a elgnl- 
f leant problem of maintenance. Equipment failures lead to a reduction 
in operational readiness, and maintenance support Is a major element in 
the life-cycle costs of a number of weapons systems. This study is 
directed toward the routine use of automated redundancy techniques to 
greatly reduce and simplify system maintenance requirements. 

The starting point to achieve this goal is the core elec- 
tronics portion of complex systems. A technology of fault-tolerant 
computing has been developed which provides correct computer operation 
in the presence of internal faults by the use of redundancy and auto- 
mated repair. Using these techniques, computers can be developed at 
relatively low cost which provide long-term reliability and which can 
be utilized to automate system diagnosis and repair by: 

(1) diagnosing faults and specifying modular replacement 
in external subsystems, or 

(2) performing automated system repairs to achieve 
maintenance-free missions. 


The scope of this work unit Is limited to the digital 
computing system and those fault-tolerance techniques which can be 
utilised In the context of a computer bulldlng-block development pro- 
gram using next generation VLSI technology. 

Although the theoretical groundwork for fault-tolerant 
computing has been rather well developed* the use of such machines has 
been limited to a very small number of special applications. The Apollo 
guidance computer* OAO spacecraft* and ESS telephone switching systesm 
are the primary examples which are most often quoted. These are all 
custom machines for a specific application. 

This study is directed at the question: "What is required 

to enable the routine use of fault-tolerant computing in a wide range 
of applications?" First, the requirenent for fault-free computing must 
exist, e.g., the system designer must express a need for correct answers 
and no unscheduled downtime. But in order to levy this requirement, the 
designer must be assured of two things: 

(1) that the cost of a fault-tolerant design is lower than 
the cost of an occasional computer failure. 

(2) that the risk is acceptable, i.e., that the fault- 
tolerant computer will be delivered in time and work 
as specified. 

In order to achieve the twin goals of low cost and risk it 
is best to avoid custom designed computers, and concentrate on machines 
which are already in wide usage. Not only is extensive software avail- 
able, but existing chip sets such as the TI 9900, LSI 11, and the 8086 
have been characterized and tested through widespread use. 

Thus, we have concentrated on the use of existing machines 
in fault-tolerant configurations. In order to satisfy the project user 
with regard to risk, the resulting architecture should bo straight for- 
ward and operate in a fashion that can be readily understood. It should 
he compatible, as much as possible, with existing standardized components, 
interconnections, and busing formats. And, indeed, the fault-tolerant 


3-2 


archiCMCurtt «koal4 b« capable of a wl4e taate of appltcatloaa ao that 
It can be Included in a future etaadarda prograa* Rlak, and often cost* 
la loweat uben a project cm uae conponenta pad arcbitaecurea Wblob bare 
prmioua operational experience. >> 

In order to achieve acc^able eoate* the surrounding clr« 
cults (used to coodklne processors and OMnorles Into a fault-tolermt 
configuration) must be reduce to a small mndtor of standard elmaats 
and implemented in VLSI packages. At the current state of the art» a 
microc<Nq>uter may require SO LSI chips* uhlle the surrowkding cir- 
cuiti^ for fault-tolerance and intereonnecte'may require several hundred 
MSI circuits. In order to make fault-tolerance attractive to the user* 
those surrounding circuits must be packaged as a few standard VLSI 
components. 

The primary objective of this study Is to develop and verify 
a small set of building block VLSI circuits which can be used to combine 
existing processors and memories into fault- tolerant computer 
configurations . 

3.1 REQUIREMENTS FOR FAULT-TOLERANT BUILDING-BLOCK COMPUTERS 

(FTBBC) 

Fault-tolerance requirements are derived from a set of 
assumptions on the applications in which the FTBBC will be used. These 
assumptions on applications and the resulting requirements are listed 
below: ^ 

(1) The fault- tolerant computer(s) will be used in a wide 
range of applications and* in some cases* will perform 
vital functions (such as system-level redundancy 
management). 

(a) Thus* over a user-prescribed maintenance inter- 
val the relidbillty should be quite high— 99% or 
greater. 

(b) Wide variations in the maintenance interval 
should be readily accommodated by adding or 
deleting redundant elements. 



i 

\ 

\ 

I 

! 

i 

1 

( 

! 

i 

if 

j 

I 


(2) The aystem eentalnliig the c«Byvter(e) will Iwve m 
operetionel life of a aniber of yearat 

(a) The fault-datactlon and recovery nachanlem of 
the FTBBC mist be thorough and nearly yerfsct to 
attain reliability over a long period of time. 
Thla is independent of how short a maintenance 
interval la chosen or how many spares are 
employed. Reliability modeling studies have 
elKiwn that "coverage** (the probability of a 
correct recovery » given that a fault occurs) 
must approach unity to achieve long-life without 
computational errors or down time. 

(3) It le assumed that for moat systems, regularly 
scheduled maintenance is possible. The computer will 
"fix itself" by replacing faulty modules with spares; 
and the discarded faulty modules will be replaced by a 
repairman at the scheduled maintenance time. In this 
mode of operation, the scheduled maintenance is best 
described as preventive maintenance since the computer 
is still running. It is important, however, that the 
scheduled maintenance costs be minimized. Therefore: 

(a) Redundancy should be applied in an efficient 
fashion to minimize the number of parts which 
can fail, and to reduce initial procurement 
costs. 

(b) The fault-tolerant computer(s) should be capable 
of diagnosing its own faults to a level which 
facilitates off-line repair. 

(4) For applications where human repair is not possible, 
the maintenance interval will be specified to be the 
total operational life of the computer(s) and an 
appropriate number of spare elements shall oe employed 
to achieve the desired reliability. 


I 


3-4 


(5) The functions to be performed by the eon^ter(e) will 
be vital to the proper operation of its host syston* 

(a) The computer(s) should not ^nerate erroneous 
outputs between occurrence and correction of a 
fault. This implies concurrent fault detection 
in all parts of the computer(s). 

(6) Systems have a wide range of requirements on the 
allowable tlme-cnitage while the computer(s) is recover- 
ing from a fault. 

(a) Capability must be provided to allow for a 

recovery time In milliseconds tdiich is assumed 
to be a worst-case requirement. 

In short, the FTBBC architecture must have concurrent fault 
detection to attain high coverage and a rapid recovery time. The struc- 
ture must also be modularized to allow an arbitrary number of spare 
elements and simplify replacement procedures. 

3.2 DISTRIBUTED COMPUTERS 

A distributed computer architecture was selected as the 
baseline approach for building block Implementation because we feel that 
It will have the widest range of applications. (Also, a single computer 
architecture is a degenerate case and is thus covered.) Since most 
complex systems consist of a set of subsystems, and since the availability 
of microcomputers Is making it possible to place low cost computing where 
it is needed within these subsystems, we believe that there will be an 
ever-increasing demand for distributed computing in military applications. 
It has been shown in previous work (CART 77) that self-checking computers 
are feasible and relatively inexpensive. A distributed network of such 
computers can be hardware-efficient in that (1) other computers are 
available to aid In the repair of a faulty machine, and (2) redundancy 
can be provided in a selective fashion. It is felt that the high degree 
of modularity inherent in distributed systems best meets the varying 
requirements of performance and reliability, ana offers the potential 


3-5 


for olnplifiad fault-toloraneo opproaclioB which cm bo understood and 
thus accepted by e potentlel user. 

A suporficiel view of e distributed systen conaiote of a 
number of interchangeable computers connected to 1/0 devicee through e 
redundant, shured busing system, es shown in Figure 3-1. 



Figure 3-1. A Non-Dedicated Distributed Computer Architecture 

To provide fault-tolerance, ttie computers may be designed 
with internal checking logic to detect their internal faults, or pairs 
of computers may run the same computations and compare outputs, or the 
machines may be run in triplets with output voting. A common set of 
backup spares is used to replace failed computers. These approaches 
have the advantage of nondedlcated redundancy, in that any spare can be 
used to back up any of the active computers and a small number of spares 
can be used to back up a large number of active computers. 

A closer look at the problem indicates that the majority 
of computers in such a network will be dedicated to specific subsystems. 
An examination of the bus interface and control logic in various sub- 
systems shows chat, for many, it is cost effective to replace the 


3-6 



iatarnal control logic with a nicrocooqmtar — alther to aava chips or 
to astabilsh ataadardlaation in subaystaa logic daolgna* Mora Impor- 
tantly* by aatabllahlng ”intallltant" sanaora «id actuators through tha 
ttsa of local computars, syst«D laval coovlaxlty can ba graatly rMlucad. 
This la saan la savaral wayst 

(1) The subsystcmt-aystea Interface can be greatly simpli- 
fied, allowing tha subsystam contractor to ttoroughly 
tast hla davlua before system Integration. 

(2) Subsysttto-pecullar computing (software) can ba devel- 
oped by the subsystem contractor. 

(3) The computing load on central computars can be drasti- 
cally reduced, since they are no longer required to 
generate detailed timing signals used In the associated 
equipment. They are Instead generated In tha local 
computer. 

(4) Bus timing and loading are greatly simplified for 
reasons mentioned above. 

Thus, the structure of distributed control systems falls 
rather naturally Into a hierarchic structure: a large set of Intelligent 

sensors and actuators containing their own dedicated computers, and a 
smaller set of non-dedicated, high-level computers which coordinate the 
lower level processors. 

3.3 THE DISTRIBUTED COMPUTER M(H>EL 

The model used In this report for a distributed processing 
architecture Is shown In Figure 3-2. 

Redundant elements and checking circuits are not shown in 
order to focus on the basic computational functions which are performed 
in a fault- free environment. 


3-7 





n 



Figure 3*2. The Distributed Processing Architecture 

The microcomputer modules which utilize the same micro- 
processor and local executive fail into two types: (1) Terminal Modules, 

which are configured with I/O circuits to inte*‘face with electromechani- 
cal subsystems in which they are embedded, and (2) High-Level Nodules 
which are configured to coordinate the processing in various computers 
by control of an intercommunications bus. 

Terminal Modules (TM) are located within the various sub- 
systems and are responsible for local control and data collection. The 
Terminal Module contains a microprocessor, memory, a set of I/O modules, • 

and a passive Interface (Bus Adaptor) to each of several intercommuni- 
cation buses. Each Bus Adapter contains a complete DMA controller which 
allows the bus system to enter or extract data from the Terminal Nodule's 
memory by cycle stealing techniques. Communication Is through message 
slots In the local memory. 



3-8 






















A Hlgb-Ltv«l Ho<hil« Mtcrs coonaadtt datai and tiaing lafor-> 
nation into praarrangad maaory araaa within tha Taminal Hodula. Tha 
Tarminal Modulo dalivara infonnation to tha ayatan by placing outgoing 
nasaagaa in pradatarminad locationo of ita naonry, which ara than 
axtractad by a Righ-Laval Hodula ovar tha bua. 

Tha TM memory can ba accaaaad by oavaral buaaa aiontltanaoualy 
bacauac tha bua adaptora provida conflict raaolution. Tha TM computar 
ia normally not notifiad that data ia baing antarad or takan from ita 
m«nory. Periodic processee synchronitatlon is provided by a common 
Real-Time Intarrupt which triggers a local axacutiva to chack tha TM 
memory for incoming commands and data at pra-arrangad times • 

High-Level Modules (HLM) ara responsible for coordinating tha 
processing which is carried out in tha remote Tarminal Modules or in 
High-Level Modules which are lower in the network hierarchy. Each High- 
Level Module consists of a microprocessor » memory • Bus Adaptors^ and a 
Bus Controller. 

The Bus Controller* which is unique to High-Level Modules* 
can move blocks of data between memories of all modules connected to its 
bus. Using the Bus Controller* the High-Level Module can place commands 
into the memories of the various computers on its bus and monitor ongoing 
processes by reading out selected information. 

When activated* the Bus Controller reads a control table 
within the memory of the HLM which specifies the transfer, issues these 
commands over the bus to the relevant terminals in the source and 
acceptor modules, and then monitors bus activity as the selected modules 
exchange infonnation. 

Using the Bus Controller* the HLM can move a block of data 
from within any Internal memory area of a specified source module to a 
specified set of contiguous locations within one or more acceptor 
modules . 


3-9 


* 


3.3*1 TIm lat«rcoaBMale«tioa But Stntceur* 

Bach aetiv* Blgh-Lcval Hodula haa a dadlcacad bus under ics 
control idilch provldnn a bandwidth of approulaatoly ona Mgablt. In 
order to provide redmdaacyt the HUf can relinqulah its bun under one of 
two condltlonat (1) It la not poweredt or (2) Ita proceeaor apecifically 
releaoM the hue for a apeclfied time Interval. Thua» apare nodulea can 
gain acceaa to a bun whoae proceeaor baa failed, or a bua can be aulti* 
plexed if aeveral buaea have failed. 

Acceaa to each bua by the varioua High-Level Hodulea la baaed 
on a fixed-priority asaignMnt \ialng a daisy chain atructure, aa shown 
in Figure 3-2, to establiah this priority. Modules of higher priority, 
signal release of the bus via the daisy chain tdilch then activates that 
hardware neceesary to allow bus control by modules of lower priority. 

The individual buses are physically independent and, therefore, no cen- 
tral controller exists as a potential catastrophic failure nechsnisn. 

The Bus Controller and Bus Adaptors are highly autonomous 
units which contain considerable internal microprogram sequencing to 
carry out their functions. For example, the Bus Controller is activated 
by an out-of-range store instruction in the HLM, the data "stored” is 
the address of a bus control table. The Controller reads out the table 
by DMA and controls a data transfer over ics bus without further atten- 
tion from the HUi processor. Completion is signalled by an interrupt 
with a status word stored in the HIM memory. 

A bus control table in the HIM contains the identification 
and internal memory address of a source module, and the identification 
and internal addresses of one or more acceptor modules, followed by a 
word count. Internal addresses can be specified directly or by naming 
Indirect pointers contained within the various source and acceptor 
modules. This allows accessing data by name. 

The Bus Controller reads the control table and sends the 
source and acceptor specifications over the bus as 1S53A transmit or 
receive commands. The source module then outputs sequential words from 
memory and the acceptor module Ingests this data. 


3-10 


The Bus Adaptors contains sufficient microprogram control to 
recognize transmit (source) and receive (acceptor) commands directed 
toward their host computer. These modules then determine the base 
address of data to be transferred — either a number received over the 
bus for direct addresses » or a number read from a mapping table In local 
memory for indirect addressing. The adaptors then steal cycles from the 
processor to transfer information into or out of its memory. 

A non-fault-tolerant version of this architecture has been 
developed under NASA sponsorship, and a six computer breadboard has been 
constructed and used to verify its software and communications concepts. 
Tlie breadboard was used to simulate several command, telemetry, and sub- 
system control functions of a planetary spacecraft. Further information 
can be obtained in the following references: RENN 76, LESH 76, and 

RENN 78b. 


3.4 FAULT-TOLERANCE OPTIONS 

In the distributed network, there are three distinct areas in 
which fault tolerance must be applied; the dedicated Terminal Modules, 
the nondedlcated High Level Modules, and the interconnecting bus syst^. 

3.4.1 The Terminal Modules 

Since the Terminal Modules are attached by a number of wires 
a specific subsystem, they must have dedicated spares which are also 
embedded in the same subsystem. Thus, when redundancy is employed, 
dedicated cross-strapped redundant modules are used. This requires 
special short-isolated I/O circuits so that (1) a short will not disable 
spare modules, and (2) a faulty terminal module can be disabled and a 
spare module activated by simply turning off power to one and turning on 
power to the other. 

The amount of redundancy of Terminal Modules is determined 
by the criticality and failu; t rate of an associated subsys .em. For 
a block-redundant subsystem (i.e. two identical subsystems, primary and 
spare) redundant TMs may not be employed in each Individual subsystem. 

But for a subsystem which manages a redundant set of sensors and actu- 
ators, the TM should be backed up by one or more redundant spares. 


3-11 


fg^4 in a TM can consist of 


(1) self-checking hardware built into the computer which 
detects faults concurrently with normal operation* 

(2) or software diagnostics for subsystems which are non- 
critical and can tolerate a period of erroneous compu- 
tations* 

The second option above is only viable if the interconnect- 
ing bus system prevents errors generated in a faulty Terminal Module 
from propagating through the system and affecting other modules* 

Fault Recovery can be handled locally within the terminal 
module configuration of a subsystem or can be handled by the High-Level 
Modules* If fault recovery is implemented locally^ TMs perform **cros8 
checks'* through their I/O logic to allow local fault detection and 
reconfiguration [RENN 80b]* This is often unnecessary, since the High-* 
Level Modules provide an available intelligence which can be used for 
this purpose [RENN 80a] • Specifically, the Terminal Module hardware (or 
software, through a failed diagnostic) provides a fault indication which 
can be sampled over the bus by the High-Level Modules* The appropriate 
High-Level Module then commands reconfiguration to a backup spare via 
the bus. This recovery process contains a delay of a few milliseconds 
but is acceptable tor many applications. 

3.4.2 The High-Level Modules 

The High-Level Modules have two salient reliability charac- 
teristics. First, they cannot be allowed to make errors, since they 
perform high-level control functions and can, by use of a bus, propagate 
damaged data throughout the network. Second, they are nondedicated and 
can be backed up with a common pool of spares. Two approaches were 
investigated for employing redundancy in High-Level Modules, voted 
functions and standby redundancy. 

The vote d functi ons approach consists of creating a mechanism 
to configure groups ot three High-Level Modules to perform each separate 
computer function. Each tiiplot is voted and when a fault occurs, the 


3-12 


rwMining two modules oF an affected triplet 1e<noiiand Its replacwnent with 
a spare from the common pool (HOPK 75). The advantage of this approach 
Is that ongoing computations are not Interrupted by a failure since the 
two remaining computers can continue with the ongoing computation until 
a convenient time to reconfigure. It has the disadvantage that It Is 
expensive and complex. Three computers are required for each computation 
and the triad reconfiguration mechanism is complext and bus bandwidth 
is tripled by redundant measage transmissions [REMN 80b). 

The stan d by redundancy approach uses computers which are 
self-checking. .n HUl contains an error code protected memory, com- 
pared duplex pro. sors, and fault-detecting bus circuitry. With a high 
degree of confidence, the HLM will detect its own faults when they occur. 
Redundant circuits are employed to disable the HLM's ability to control 
an intercommunication bus when a fault is detected. If the function 
being performed is time-critical, a backup (self-checked) module runs 
concurrently with the active HUf. If the primary HIH disables itself, 
the "hot" backup HUf springs into action, taking up the ongoing compu- 
tation. For non-critical, high-level functions that can be cold-started 
after being lost for a second or so, no "hot" backup spare is provided. 

A critical function module effects Its reconfiguration by activating a 
spare, loading it from mass storage, initializing its parameters and 
then restarting the non-crltical process. 

The standby approach is more efficient than the voted func- 
tions approach in the use of hardware, especially if some of the high- 
level functions do not require "hot" backup spares. The disadvantages 
are (1) lower "coverage" than voted approaches, and (2) time delays in 
recovery. 

3.4.3 The iiitercommunlcation Bus System Requirements 

The intercommunication bus system should be redundant and 
provide restricted access so that faults are not allowed to propagate 
indiscriminately. Equally important, the structure and functions of the 
bus system directly influence the complexity and verifiability of soft- 
ware. Bus attributes and options that we have chosen for fault-tolerance 
are discussed below. 


M3 


(1) Redundant buses are required with no common failure 
mechanism in their assignment l<^ic so that only one 
bus will fall due to any single faults This can be 
achieved with a separate mechanlon for each bus which 
assigns buses to high-level modules on the basis of 
fixed hardware priority. When a high-level module is 
disabled t its bus priority should be relinquished. 

(2) High Level Modules should be capable of initiating bus 
transmissions* but Terminal Modules should be passive 
and not have this capability. This allows structured 
control and prevents a failed Terminal Module from 
directly upsetting the whole system. (It is expected 
that in many systems, some Terminal Modules will not 
be self-checking.) 

(3) Each high-level module should have control of only one 
bus for any ongoing system configuration. Centralized 
control is easier to verify and eliminates the indeter- 
minate timing inherent in a multiply controlled bus. 

(4) The bus structure should minimize the software complex- 
ity required for its control, and it should be used in 

a way that a minimum of transmissions are time-critical. 

(5) The bus system should provide automatic verltication 
of proper message transmission so that the High-Level 
Modules can detect faults and utilize alternate redun- 
dant buses in case of failure. 

3.4.4 Architecture Selection 

In order to be able to implement all of the various redundancy 
options (described above) we concluded that self-checking computers 
should be employed throughout the FTBBC architecture. Recent publica- 
tions have shown that self-checking computers are feasible and can be 
built relatively inexpensively in VLSI logic (CART 77]. Using 
hardware- implemented fault detection, the self-checking computer can 
detect internal faults concurrent with normal operation. This property 


3-14 


is essential to implement standby redundancy, which is expected to be 
used in the majority of computers in many distributed systems. It also 
augments the effectiveness of voting configurations which nay be employed 
in smaller, more critical portions of complex systems. 

The Self-Checking Computer Module, and its communications 
interfaces are described below. This basic computer module was chosen 
to best meet the fault-tolerance requi.^ento cf a wide variety of 
potential applications. 

3.5 BUILDIMG-BLOCK DEF1N1TI(»( 

The basic component of this fault-tolerant distributed com- 
puter architecture is a Self-Checking Computer Module (SCCM). The SCCM 
can be assembled from microprocessors and memory chips, connected by a 
small number of standard building block circuits described in the 
remainder of this chapter. Each building block is small enough to be 
implemented as a single VLSI chip, and provide the memory, I/O, and 
intercommunications functions necessary to interface the SCCM within a 
redundant network. The SCCMs are then used as larger building blocks 
in a network, in which redundant SCCMs are included to achieve fault- 
tolerance. 

3.5.1 The Self-Checking Computer Module (SCCM) 

The SCQ1 contains commercially available microprocessors and 
memories, connected by four types of building blocks, as shown in 
Figure 3-3. The building blocks are (1) an error detecting (and correct- 
ing) Memory Interface Building Blocl. (MIBB) , (2) a programmable Bus 
Interface Building Block (BIBB), (3) a Core Building Block (Core-BB) , 
and (4) an I/O Building Block (lO-BB). A typical SCCM consists of 
2 microprocessors, 24 RAMs, 1 MIBB, 3 BIBBs 2 lO-BBs, and a single 
Core-BB. A High Level Module is an SCCM containing an additional BIBB 
microprogrammed to be a Bus Controller, while a Terminal Module is a 
SCCM with all of its BIBBs programmed as Bus Adaptors (terminals). 

The building block circuits control and interface the various 
processor, intercommunication, memory, and I/O functions to the SCCM’s 
internal bus. Each building block is responsible for detecting faults 


3-15 


MEMORY MODULE 



Figure 3-3. The Self-Checking High-Level Module 
















In its associated circuitry and then signaling the fault condition to 
the Core-BB by means of an internal fault indicator. The MIBB imple- 
ments fault detection and correction In the nmmoryt as well as providing 
detection of faults in its own internal circuitry. Similarly » the BI-BB 
and lO-BB provide intercommunications and I/O functions » along with 
detecting faults within themselves and their associated communications 
circuitry. The Core-BB checks the processing function by running two 
CPU's in synchronism and comparing their outputs. It is also responsible 
for fault collection and fault handling within the SCCM. 

The Core-BB receives fault Indicators from the other 
building-block circuits and also checks internal bus information for 
proper coding. Upon detecting an error* the Core-BB disables the 
external bus interface and I/O functions* isolating the SCCM from its 
surrounding environment. The Core-BB can either: (1) halt further 

processing until external intervention* or (2) attempt a rollback or 
restart of the processor. Repeated errors result in the disabling of 
the faulty SCCM by its Core-BB. Recovery can be affected by an external 
SCCM which is programmed to recognize the lack of activity from a faulty 
SCCM. 

An important attribute of the building blocks is that they 
are interconnected via the internal processor-memory bus. They are all 
designed to perform specified functions in response to read or write 
commands to reserved addresses appearing on the internal bus. The 
majority of addresses are used for conventional access to RAM; however, 
the upper 4096 addresses are reserved for I/O functions, external bus 
transmission requests, the readout of error-status information, and 
reconfiguration commands to the building blocks. For a fetch request to 
a specific reserved address, the building-block circuit which recognizes 
the address performs the specified function and delivers a word of infor- 
mation to the internal data bus. Store requests to reserved addresses 
deliver information ovei the internal data bus to the selected building 
block. This is the commonly used technique of •'memory-mapped I/O** and 
it has two major advantages in the building-block SCCM design. First, 
this approach avoids processor-specific I/O operations and thus allows 


3-17 


Che use of e number of dlffer«ic off-the-shelf microprocessors in the 
SCCM. Second* this approach allows access to the building blocks by 
both software in the SCCM and from other SOCM*s via the external bus 
system. Using the external bus an external SCOt can command DMA READ 
and WRITE operations into and out of the memory of the local SCOI. By 
directing DMA* READ, and WRITE cycles to reserved addresses* the external 
SCCM also has access to the building blocks in the local SCO(. The 
external SCCM can load and read out memory via the bus* and can also 
sample error status Information* command internal reconfiguration, and 
can even remotely control I/O in a faulty local SCOf. 

The following is a brief description of the building-block 

circuits. 

3.3.2 The Memory Interface Building Block (MIBB) 

The MIBB interfaces a set of RAM chips to the Internal bus 
of the SCCM to form a Memory Module. An SCCM can certain one or more 
Memory Modules. A Memory Module consists of: 

(1) A 24-bit memory with each bit separately packaged so 
that circuit failures will damage only one bit in any 
word. Sixteen bits are utilized for storage of com- 
puter data* six bits are employed for a SEC/DED 
Hamming code. The remaining two bits are used as 
spares to replace any of the other bits in case 

one fails. 

(2) A Memory Interface Building Block which connects the 
redundant memory elements to the internal bus. The 
MIBB provides control, Hamming encoding and correction, 
spare bit replacement, parity encoding and checking 
for the local bus, internal checking, and error 
message generation. 

The MIBB Is connected to the SCCM internal bus and receives 
address, data, and two control signals: A Read/Wrlte level, and Memory 

Start. Upon receiving a start command, the SCCM checks a parity coded 
Incoming address from the bus, and for a write operation also checks 


3-18 


Inconing data for proper coding. If no error Is detnccedt a read or 
write operation ia initiated and a completion signal la generated. If 
a single bit error ia detected upon reading, it is corrected using the 
Hamming code. 

Two fault-detection aignala are generated-an internal fault 
indicator and a code-correction indicator. Bach is sent on duplex lines 
so that a single fault cannot disable an indicator and go undetected. 

The code-correction indicator is sent to the processors as 
an interrupt, and indicates that a single memory-bit error is being 
corrected using the Hamming Code. The processor can inspect the damaged 
location and, if necessary, command that the faulty bit be replaced at 
a convenient time. 

The internal fault indicator signals all faults which cannot 
be corrected within the memory system. This signal is activated when: 

( 1 ) a fault is detected within the MIBB itself 

(2) improperly coded information is received over the 
internal bus 

(3) a data error occurs within the memory elements that 
cannot be corrected using the Hamming code. 

This signal is sent to the Core building block. If the 
error was caused by a transient fault, correct computation can some- 
times he resumed with a rollback or reset/restart sequence, initiated 
from the Core-BB. * 

The MIBB can receive several commands to read out status, 
test faulty locations, and perform internal reconfiguration. These 
conunands arc Implemented as out-of-range memory addresses and can thus 
be issued by the processor or through the bus system. Specifically, 
certain out-of-range read or store instructions are recognized as com- 
mands to the building block and data is absorbed or disgorged for write 
and read operations, respectively. 

MIBB commands are listed below: 

(t) READ STATUS - Internal fault latches are read out to 
the internal bus. 


3-19 


(2) READ ERROR F081TI0N - Th« bit position of ths o»st 
rscont error Is read out. 

(3) READ ADDRESS OP LAST ERROR - The address where the 
last error occurred is read out to the intaroal bus 
(aloi^ vith an indication if more than one bit has 
been corrected) . 

(4) RKET - Disables spare-bit replacement, returns to 
original 16-blcs of data. 

(5) DISABU CORRECTION - Disables Hamming correction so 
that the memory can be externally diagnosed through 
the bus system under control of a different computer 
module. Correction is re-enabled by a reset command. 

(6) READ REDUNDANT BITS - Used in conjunction with disable 
correction, reads out the Hamming protection bits and 
spare bit from the last address accessed In the 
memory. 

(7) REPLACE Ith BIT - Causes spare-bit to replace the 
specified bit position. (Two commands are provided - 
one for each spare bit plane.) 

Several optional Memory Module configurations can he sup- 
ported by the MIBB. The user can select the number of memory words 
included in the Module (8K. 16K. 32K) . It is also possible to implement 
a Memory Module which does not use Hamming single-error correction. 

Using this option, each memory word consists of 16 data bits, 2 parity 
bits for error detection (the same code as is used on the internal bus), 
and 0 to 2 spare hits. Upon detection of a fault, it is necessary to 
diagnose the memory and command reconfiguration using an external SCCM. 
This option is provided for applications which require very low power, 
weiglit, and volume. Options arc selected using external pins on the 
MIBB. 


3-20 


An Intarnal error Indication la ganaratad upon receipt of 
improperly coded data or upon read-out of laproperly coded information 
in the memory. The aame error detecting code ie employed for the 
internel bua and the memory plane. 

3.5.3 The Core Building Block (Core-BB) 

The Core Building Block provides three functions: (1) inter* 

nal bua arbitration, (2) procesaor comparison with parity code generation 
and checking, and (3) fault-handling. This building block uses self- 
checking design, such that a fault in the Core element will result in 
disabling the Bus Controller and reaunring the siodule from the syston. 

3. 5. 3.1 Bus Arbitration . A Bus Arbiter in the Core-BB accepts inter- 
nal Bus Request signals from the Bus Adaptors, Bus Controller and, in 
the case of terminal modules (to be discussed), from DMA I/O channels. 

Upon receiving Bus Requests, the Bus Arbiter signals the CPUs to release 
the bus. When the CPUs acknowledge release, the Bus Arbiter returns a 
Bus Acknowledge signal to the requesting element on the basis of fixed 
priority. Both Bus Request and Bus Release signals are duplicated with 
values 01, and 10 representing valid states. The Bus Arbiter is also 
duplicated and is compared with self-checking internal logic to detect 
its internal faults. 

3.5. 3.2 Processor Comparison, Code Generation and Checking . In order 
to detect processor faults, two processors are tun in synchronism. Both 
receive the same data and execute the same programs in lock step. One 
processor is designated primary and the other serves as a check processor. 

All outputs of the two processors to the internal bus are 
conq>ared by the Core-BB and the 16-bit outputs to the address and data 
buses are parity encoded. Incoming data on the internal bus is checked 
for proper parity coding. 

If the processors disagree, if incoming data is incorrectly 
coded, or if an internal error is detected by self-checking logic within 
the building block circuitry, an error message is sent to the fault- 
handling section of the Core-BB. 


3-21 


3. 5* 3. 3 Fault-Hawdlloa . The fault-handling aaction of the Cora 
building block racalvea Internal fault algoala from the varloua building 
blocka and from within the other sectlona of the Cora* Whan a fault la 
algnalladt the fault handler aends an output Inhibit algnal “ > the Bua 
Controller and/or lO-BBa and atopa the proceaaora. Aa an optional 
feature » the fault-handler can affect a program rollback by caualng the 
proceaaora to tranafer to a reatart location. The proceaaora attempt to 
re-lnltlallea computatlona. The proceaaora can conmand that the module 
be re-enabled (releaae output Inhibit) if no additional faulta are 


detected In the intervening period. 

3. 5. 3. 4 Core Building Block Connections and Commands. Core Building 

Block Connections 

Include: 

(1) 

32 address and data lines to the check processor. 

(2) 

Control lines to and from each processot^eset/ 
restart, bus request for DMA. and bus released 

(3) 

42 connections to Internal bus for address data and 
control 

(4) 

Clock and Real-Time Interrupt 

(5) 

Bus Request pairs from up to 5 DMA elements and 
corresponding Bus Acknowledge signals (24 lines) 

(6) 

Internal Error inputs from up to 8 other Building 
Blocks (12) 

(7) 

Output Inhibit to Bus Controller (2) 


The Core-BB accepts the following coiraiands which are decoded 
as out'of-range read instructions on the internal bus. Both the local 
CPUs and external modules can issue these commands, the latter via an 
external intercommunications bus. 

(1) Disable Module-Computers are halted and an output 

inhibit is sent to the Bus Controller and/or lO-BBs. 


3-22 


< 2 ) 


SB8TABT ~ era* an raaat aad coapucatim be^lna at 
tha rollback/raatart locatioo. 


O) Boabla Modula ~ Salaoaa output iahibit to tha Bua 
Controllart aad 10>BBa. 

3.5.4 Hia Bua Interface Buildinf Block (BIBB) 

The Bira can be microprogranued aa a Bua Controller (BC) or 
aa a Bus Adaptor (BA). The bus systaa uses N1I^8TD-1553A formats as 
shown in Figure 3«4, and the BC and BA provide controller and terminal 
functions of that standard. The capabilities of the BC and BA are 
augmented to provide the following additional functions: 

(1) Hoving data directly between memories of their host 
SCQto using direct memory access (DMA). 

(2) Specification of data to be moved by ’’name" (using 
automatic table look-up in the local SCOl), or by •its 
internal memory address. 

(3) Concurrent detection of message errors and faults 
within the BIBB. Communication of fault conditions 
to the hoot SCCM. and disabling the host SCCM under 
some fault conditions. Signalling SCCM shutdown via 
1SS3A status messages. 

(4) Providing redundant communications paths through the 
use of redundant bases. 

Since a primary requirement is fault-tolerance, the BIBB is 
designed to detect its own internal faults. Upon detecting such an 
internal fault, the BC and BA behave differently. Bus Controller faults 
are signalled to the Core-BB which disables the host SCCM in order to 
prevent damaged information from propagating throughout the system. 

(A faulty BC can potentially move data to or f.om the wrong place.) 

The Bus Adaptors are redundant, since several Bukok are 
connected to a given SCCM (each through a separate BA). If a BA failure 
does not prevent its hoot SCCM from performing correct computations. It 
is possible to re-route messages throue.h a different BA and continue 


3-23 


MtTMISi 


M WOKDMKMAn 


L 

2 

3 

4 

s 

4 

7 

• 

f 

to 

n 

12 

13 

14 

15 

14 

17 


14 

□ 


COMMAND WO»i 


1 


A 

1 


5 

□ 



9 



1 SYNC 

mMINAL AOORtSS 


SUIADOMSViMOOt 

DATA WORD COUNT 

1 P 


OATAWOKOi 

[izf 

j svr 

SIATUS WOHDt 





1 

9 

1 

1 



5 


$Y 

NC 

TtRMINAL ADDRIU 

M€ 

STATUS CCOIS 

IT 

r 



CONtKOiUI 

TOtfMAINAL 

TRANSm 


(bi MCSSACI fOltMATS 


RCvllVf 

DATA 

DATA 


DATA 


STATUS 

COMMAI^ 

WORD 

WORD 


WORD 


WORD 


R 


TBIMINAllO 

CONHOU9 

TMNSRO 


TRANIMIT 

• 

STATUS 

DATA 

DATA 


DAIA 

COMMAND 


WORD 

WORD 

WORD 


VORO 


T 


taiMINALTO 


TUNSm 


RKflVf 

IRANSMIT 

• 

STATUS 

DATA 


DATA 

O 

STATl T 

COMMAND 

COMMAha) 


WORD 

WORD 


WORD 


WORD 


t t 


• 2 - 5m UC vwOtD GA^ 

T ' moM nuNSMumir 

U - FROM RcCCIVft rCRMINAi 


FiKurt' )-4. MII.-STD 1553A Formats for (a) Words and (b) MessaRcs 


3-24 



normal oparation. ^arafere, upon datactlng an Intexpial fault* tlw 
hardwara of a BA diaablaa its ability to eaanunicata tha axtamal 
boa and into tha boat SCCM. It doaa not disabla the 8CQ1 and othar BAs 
can be uaed to continue conmunicationa* 

3. 5. 4.1 BIBB Connactions ify^ttana. BIBB connactione fall into 

aeveral groupa aa ahown in Figure 3-5. 

The BIBB-SCCM Interface eonaiats of connactlona to (a) tha 
SCCMa Internal addresa bua (AB)* (b) the internal data bua (BB)* (c) DMA 
requeat and acknowledge (R, AR)* (d) an Interrupt to the proceaaor 
(RUPT), and (e) an internal fault indicator (IF). This interface allowa 
the BIBB to enter or extract worda from the local memory by cycle 
ateallng; to alert the proceaaor of an error or coaq>letion using the 
interrupt, and to signal an internal fault. 

The Direct Command Interface eonaiats of a set of output 
lines (DC) and a strobe signal (ST). In response to a special "direct" 
command, a strobe signal is delivered and the output lines can be 
divided to activate discrete events. 

A set of Configuration Pins are hard-wired to Vcc or ground 
to specify the hard names of the BIBB on the 1S53A external bus and on 
the internal bus (for memory-mapped control). 

The External Bua Interface connects with discrete driver and 
receiver circuits for the 15S3A bus. These connections include data 
output lines (HILO, OUTEN), data input lines (INBUS Hi, INBUS LO) , and 
alternate bus selection signals (BSEL, BBSY). A Bus Adaptor is only 
connected to a single bus. Therefore, in a BA the bus selection signals 
are unused. The data input and output lines are connected to a single 
driver/receiver package. 

The Bus Controller can communicate over any of several buses. 
Therefore, it Interfaces with a Controller Interface Module (CIM) which 
contains several sets of driver/receiver electronics. Ue have decided 
to place the bus assignment (allocation) logic in the CIM as well. When 
the BC starts to initiate a bus communication, it specifies which of 




3-25 







••varal buMs it wUh«t to UM (B8BL). tf that boa is in oae by a BC 
of highar priority a busy signal is ratumad 

3. $.4. 2 Bus Controllar * The SGCM raquosts its Bus Controllar to 

axecuta ru extamal bus transfar by **storing" to ona of savaral out-of- 
range ar'drassas. Pour bits of this address specify vHiich of several 
buses to use for the transmission. The data being "stored" specifies 
the address of a Bus Control Table (BCT) in memory which specifies the 
transmission to be carried out. 

The BCT contains a control word. and: 

(1) One or two 15534 commands — One for Termlnal-to- 
Controller or Controller-to-Terriinal. or two for 
Terminal-to-Terminal transmissions. 

(2) The local address (in the BCs host SCCM) from which 
data is to be extracted or stored. 

The BC initiates and monitors the specified transmission 
and moves data into or out of local memory as required. It places a BC 
status word in a fixed memory location and delivers an interrupt upon 
completion. The BC-status word indicates: 


(1) 

Transmission Aborted, bus not 

available. 

(2) 

Transmission Unsuccessful due 

to coding error or 


unretumed status. 

» 

(3) 

Transmission Successful but BAs SCCM has failed. 

(4) 

Transmission OK. 


(5) 

Activity or Requested Bus. 



The status words embedded in the 1553A transmission are also 
stored in memory and are available for software reference. 

3. 5.4.3 The Bus Adaptor. The Bus Adaptor operates as an "intelli- 
gent" 1553A terminal. It is controlled via the Intercommunication bus, 
and executes 1553A transmit and receive commands. For most commands 


3-27 


I 


Moetvttd ovr th« biui» vIm adaptor obtalaa a data addroaa corxaepondlag 
to tha 5-bit SttbehannalAlodo (S/M) ftold of tha cmnnaad* fha adaptor 

t 

thao d^koatta or withdraifa words fron aaquoBtlai nmiiory loeattoaa by 
diract UMinory aceaaa (OMS) to carry out tha racalpt or tranamiaaloa of 
tha apaeiflad nunbar of wor^a. 

Moat valuaa of tha 8/M fiald ara uaad as data aamaa. Thaao 
values are used aa an iadox into a look-up table in the local memory 
which specifies the phyeical addresa of the named data. Several values 
of the 8/N field are reserved for special ftmctiona. These Includet 

(1) Concatenate - continue extracting or depositing data 
from internal address tiaed in last tranamiaalon. 

(2) Designate silent acceptor - directs module to aastnae 
soft name end "listen- in** on subsequent transmission. 

(3) Execute direct command - strobe data out on direct 
command lines. 

(4) Direct addressing - specify absolute local memory 
address for next data to be transmitted. 

The BIBB, whether programmed as a BC or a BA also recognizes 
several out-of-range addresses as commands to: (1) read out internal 

status flip flops and (2) reset itself. 


3-28 


stctton 4 ^ ^ ^ ^ 

B 0 xuiw>-ii^ oncsivfniM 

Th« foUowlttg ••ettoa ptvecakM dat»ll«l <lM4irlptiMiP of tte 
major building bloeka* An lopianamtatioa la daacrlbad tot tbo MadAtp 
Inearfaea, Bua Intarfacoi and Cora building bloeka* bwtldlng bleak 

la broken Into Ita component Internal f unctlona tor «bl^ prelimiBtty 
logic deacrlptiona era provided. This aet of descriptions was used to 
generate breadboard logic designs . 

4.1 THE MEMORY INTERFACE BUILDaHG BLIXX 

The fault-detecting and correcting Itoiory Interface Building 
Block (MIBB) interfaces a redundant set of memory chips to the internal 
bus within computer modules. It provides single bit error correction to 
damaged memory data» replacement of up to two faulty bit planes with 
sf'^res, parity encoding and decoding to the Internal bus* and detection 
of internal faults. 

4.1.1 Memory Interface Building-Block Requirements 

Memory is typically among the most significant sources of 
failure in c<mputer systems. Due to the simplicity of operation and 
a high degree of modularity in organisation* the memory system benefits 
most from the error-correction techniques* In particular* the applica- 
tion of the error correction becomes very effective in the case of semi- 
conductor memories* organised with each bit on separate LSI chips. 

The basic goal in the specification and the design of the 
memory building block is to provide for a highly reliable and maintain- 
able memory system by incorporating redundancy in data representation 
and logic which allows thorough error detection* and correction of a 
majority of single-chip faults. 


4-1 


Th« fault- tolerance objective le quite atmple. Since the 
memory represente a preponderance of failure rate within a computer 
module (SCCM), single fault correction in memory will greatly Improve 
the reliability of the SCCM. Even though the SCCM is treated as a 
replaceable (throw-away) item with backup spares, improving memory reli- 
ability greatly increases the reliability of each SCCM and of networks 
made of these modules. 

Specific memory interface requirements are listed below: 

(1) The memory system should have the capability to 
correct single errors and to detect double errors In 
data words. This can be effectively achieved by 
single-error correcting, double-error detecting codes 
(SEC/DEO codes) for the storage arrays organised 
using one-blt-wlde memory chips (l.e., each bit of 
the word Is located on the physically Independent 
chip which makes all single faults affect but a single 
bit in a word). In order to enhance the applicabil- 
ity of the memory- interface building blocks, a mode 
with parity checking only should be provided. 

(2) The memory system should be able to tolerate two 
faulty bit-planes in the storage array. A i config- 
uration system should be provided which, upon the 
system command, replaces a faulty-bit plane by the 
spare one. 

(3) Parity encode data outputs for internal (data) bus 
transmission. 

(A) Check parity of incoming address and data off of the 
internal bus. 

C>) Recognize Memory Interface Building Block commands as 
out-of-range read or write Instructions. Tl>ese 
inc luile : 

(a) Set Soft Name 

(b) 


Read Krror Status Register 



(c) Rttad Error H6rd Addroso 

(d) Read Error Bit Position 


(e) Read Check Bits 

(f) Enable/Disable Read Retry 

(g) Replace i*th Bit with Spare a/b 

(h) Reset i-th Bit Replacement a/b 

(1) Enable/Disable Single Error Correction 

(6) Data and addresses internal to the building block 
shall be maintained and checked with redundant parity 
bits to allow internal fault detection. 

(7) The coding and control circuits should be self-testing, 
fault*secure» or duplicated so that no single circuit 
failure will produce an undetected output error. 

(8) A self-checked internal fault al^nal shall be generated 
(and sent to the Core building block) when a fault is 
detected within the Memory Interface Building Block, 

or when an uncorrec table error is found In memory data. 

(9) The information about detected errors in the memory 
subsystem should be collected and transmitted to the 
system upon request, in response to the READ STATUS 
command • 

4.1.2 Memory Interface Building-Block Design 

The memory system is organized as a random-access memory 
(RAM). It consists of up to 16K words of 16 data bits per word. The 
basic storage element is a 4K x 1 MOS static-cell chip. This cliip also 
contains the necessary address decoding circuits, a feature essential 
for the error isolation and effectiveness of the error coding. Tlie 
memory system operates in tlie conventional manner. Tlie primary func- 
tions of the memory are to accept data, address and control information, 
to store that data ie the location as specified by tlie address, and 
retrieve unaltered data information upon demand. The Memory System 


consista of two soctiona; the Storage Array (8A) compoeed of a aet of 
coomercially available memory chlpa, and the Memory Interface Building 
Block. The Memory Interface Building Block conaiata of five aub- 
elements, designated the Address Bus Interface (ABl), the Brror Control 
Section (ECS), the Data Bua>Storage Array Interface (DBl), and the 
Memory Control Section (HCS), aa shown in Figure 4-1. The interface 
requlrementa, commands, structure, operation, and fault-tolerance char- 
acteristics of the storage array and the building block elements are 
described in the following paragraphs. 

The MIBB is designed to operate in two basic modes. In HAH 
mode, the interface provides full error detection and correction capa- 
bilities. In HAM mode only detection via two parity bits is used. The 
error detection, correction and bit-plane replacement in this case are 
performed under the system control. The address and internal error 
checking remains the same in both modes. In the prototype version these 
modes are selected manually. 

The memory size can be specified to be N * 4K, 8K or 16K 
words. The size Is also selected manually. 

Since two spare bit-modules are always provided, the storage 
array appears in the following configurations. 

(1) In HAM mode: 

16 -f 6 -f 2 * 24 RAN bit-planes of N bits, providing 
storage for 16 data bits, 6 check bits and two spare 
bits per storage array word. 

(2) In HAM mode: 

16 -f 2 -t- 2 * 20 RAN bit-planes of U bits, providing 
storage for 16 data bits, two parity bits and two 
spare bits. 

4. 1.2.1 Memory-System Interface Specification . As indicated in the 
general diagram (Figure 4-1), the interface between the storage array 
and the system is achieved via the address bus, data bus and a set of 
control signals. These buses and control signals are specified in 


4-4 







deiall In this section. The address bus and data bus fields are 
Indicated aw follows: 



(a) Address Bus Fields (for N ■ 4K) 


FI 

1 

FI 





7 

8 

9 

10 

11 

12 

13 

14 

15 

16 


j j h"""sNF'**^ 

I 1 

Parity 

Bits 


Data Bytes (2) I 


(b) Data Bus Field (for N - 4K) 

Address Bus 

AB - (ABy, AB,. .... AB,y) 
where ABq is the most significant bit and 

ABjt, “ ABq © AB^ © . . . © AB,^ 

ABjy - ABj © ABj © . . . © AB,^ 

are even and odd byte parity bits. 


4-b 


The eddreee hue fields srst 


Out>oC-IUnge Field} 


ORF - (ABq, AB,. ABj. 

AB 3 ) 

If 

M > 4R 

|(ABo, AB,. ABj) 


if 

N - 8 K 

{(ABq. AB^ 


if 

N - 16K 

Memory Hard Name: 




- defined only if ORF • (1, 

B ess. 

1 ) = 

ORC 

MHN - (AB^, ABj, AL^, 

AB^) 

If 

N - 4K 

|(AB^, ABj. AB^) 


if 

N - 8 K 

|(AB^, ABj) 


if 

N - 16K 

Memory Soft Name: 




- defined only if ORF ^ (1 

• • • • t 

1 ) 


MSN - (ABq, AB^, AB^, 

AB3) 

if 

N - 4K 

|(ABo, AB,. AB 2 > 


if 

N - 8 K 

|(ABq, AB,) 


if 

N - 16K 

Command Operation Code: 

s 



- defined if ORF - (1. ... 

. 1 ) 






Memory (Word) Address: 




- defined if ORF # (1, ... 

. 1 ) 



MA - (AB^ ABjj) 


if 

N - 4K 

ICABj AB,^) 


if 

N - 8 K 

IfAB^ AB,^) 


if 

N - 16K 


4-7 


In other wofda. If ORP blto «r« all ones than tha 
COC blta apaeify a apaclal coaaand which is axacutad 
only hy tha MIBB with a physical (hard) nai&a Batching 
tha MHM field. If ORP bits are not all ones* tha MA 
fiald is used by the MIBB with a logical (soft) nana 
matching tha MSN field. 

Data Bus 


DB ■ (DBq» OB|t •••* DB|y) 

where DB is the most significant bit. 
Parity Bits: 

- DBy 0 DB 2 © ... DB,^ 

- DBj © DBj © ... 

Data bytes: (^®o* •••• 

Soft Name Field: 


(DB,,. 

DB,3. 

DB 

14’ 


if 

N - 

4K 

(DB, 2 . 

DB,3. 

DB 

14* 

0) 

if 

N - 

8K 

(DB12. 

DB,3. 

0, 

0) 


if 

N • 

16K 


This field specifies the logical name to be assigned 
to a memory by executing an SSN (Set Soft Name) 
command. 

Bit Replacement Position Field: 

BKP - (DB^^ DB^^) 

This 1 ield specifies the position of the hit-plane to 
be replaced by ,1 .are. 




Storage Array laf rf«c« Slgnala 


Meoory Addraaa: 


(MAR^. 

.... MAR, 3) 

if 

N ■ 

4K 

(MAR3. 

< • • ( MAR, j) 

xt 

N • 

8K 

(MARj. 

.... MAR,j) 

if 

N - 

16K 


Nenory Word (Bit plana I/O) 

BP - (BPj, BP^) 

Menory Data Blta: 

BPj • (BPq BP, 5) 

Memory Check Bits: 

BP^, - (BP,^ BP^,) 

Spare Bit Plane Data: 

SP 

a 

Read/Write: 

ft 

NWRITE 

Control Signala 

Memory Start: 

MSTART (a 1-0 trannition activates MIBB) 

Memory Completion: 

MCOMPL (a 1-0 transition indicates 

completion of an MIBB operation) 


4-9 


Read/Write: 


RH - (WRITE, NWRITE) 

■ (1,0) if write 

« (0,1) if read 

■> (0,0) if no read/write or error 

■ (1,1) if error 


System Reset: 


RESET 


Memory Error Interrupt: 


MINT 


(0,1) V (1.0) « Ijj 
(0,0) V (1,1) « Ojj 


- no uncorrectable 
memory »rror 

- uncorrectable 
memory error or 
MINT circuit 
error 


Single Error Correction Indicator: 

i ljj - no single errors corrected 

0„ - single error corrected or 

SECI circuit error 


Clock Inputs: (optional) 

0^, 02 - standard system clocks 


4. 1.2. 2 Specification of MIBB Operations and Commands . The commands 
interpreted by the MIBB are specified here as control sequences at the 
register-transfer (microprograimning) level in the context of the MIBB 
design described later in detail. 


4-10 


A general view of the MIBB operational states and the flow 
of control is indicated in Figure 

In describing cosanands, the following notation is used: 

(1) All statements are labeled; simultaneous 
register transfers are separated by 

(2) indicates register-transfer (assignment); 

( 3 ) label Indicates unconditional branch in 
control sequence; 

(4) (A,B) denotes concatenation of registers A and 

B; 

(5) All functions are implemented with combinational 
networks; the arguments, enclosed in ( ), are 
bit-vectors; 

(6) For greater readability, all conditional con- 
structs are in the form ^ ... then ... else ..., 
or ^ . . . then . 

(7) Braces "{,}*' are used to enclose clauses in 
conditional statements. 


The operations of the MIBB are specified by the following 


algorithms: 

C Initial! 
INIT: U 


ration 

POWER ON or RESET 
{ESR X- 0; C 

MINT * 1„; c 

SECI X- Ij^; c 

EBAR X- 1; C 


then 

Clear error status register 
Clear internal error flags 
Clear MIBB interrupt flag 
Clear SEC flag 

Set error bit-position to out-of- 
range value 


4-11 

















mr 4-ot 
RT«»S b| 
<•> 0; 

OUTEM 0{ 
SBCBN «• 1; 
SH 




POSAR. K Is 
a,i> ~ 


c Cl«ttf VMd mtf flag 

« Claar ratry cauat 

a Claar firat atcar flag 

$ Diaabla HtBB outputa to OR 

0 Baabla SBC seli«Da 

t Claar RBAO/WRXTB command lator 

C Set bit replacement addreae out of 
range 

C Memory ready 


MCOMPL 1; 

-*• WAIT} 

C P Is 2-bit parity register; ORCR is out-of-range command 


register. 


i1 


t 


WAIT: if FORME‘RBAD«MSTART then 

{MAR <• AB; » REAOll 
IT FORME*WRITE«MSTART then 
(MAR AB; MDR <" DB; 

P WRITEl) 

jT ORC’MSTART then 

(ORCR CDC; -► DECODE) 


C Out-of-range connnand decoding 
DECODE: (decoded command) 

READ Operatio n 

C APC is address parity check; DPC Is data parity check; SNC is 
soft name check; 

c EWAR Is error word address register which is continuously loaded 
with current address until first error; it is clc.ired on readout; 


C EBAR is error bit position register; 


4-13 


C ?AR» ABB <SK «ra tte 
corraetioo fttActteas} 

c d and c are data aed ehaek bita from iMoory bit plaoM. 

RBAOlt If BAM thm {M&R (4,6); p 4 - PAR(d))t 
If HAM than (M»R ♦ d; P ♦ (c^.c^)); 

B8Rq j g «- APCq^APC^, SNCq^SNC^. HRITB^NNBITB 

C save Addraaa Parity Check, Soft Naae Check and Read/ 
Write Check atatua 

El «■ RED(APC, SMC, RH) 

t reduce and aave for internal uae 

jl FRSTEW then (BWAR MAR) 

C store address for dlagnosia 

READ 2 

READ 2 : ESR^ 2,3 4 ^ DPCQ^Orc^, SEq®SEj, DE^j©DEj, NCE^^NCG^ 

C save Data Parity Check, Single Error, Double Error and 
No Circuit Error 

E2 DPC; 

E4 RED(SE. DE, N^); 

C reduce and save 

OUTBN -*-1; C Enable MIBB data output; 

Disable on -i MSTART 

-*■ READ3 

READ3: if 5 rR then 

{MCOMPL •*- 1; -► wait} c No memory error; completion OK 
Ijf HAM • SECEN • SER • NCER • RTRY then 
READ4 C Single Error correction 

if HAM • SECEN • SER • NCER • RTRY • RTRONE then 


4-14 


MMMw 


n" >1> rai r?f iig Tff >S?TlTirfTWTiT\yipfMM-iMriTT rr r^W a 




{*B ♦ (0*1); NDR ^ 000 <w); 

C wad retry 

if HAN«BUI*m ^ i5l*B2R ■» BTRY^KtOO^ » 180 ttWP 
(mirr ♦ Oj^; MCOUPL ♦ l; PRS^ <• l; 

•*• UAIT $ uncorrecc^le nmnory error 
READ4: HDR *■ COR(w); 0 Single error correction 

EBAR BBA; C Save bit poaititm 
SBCI <^0^; 0 Set sec flag 


0 Compute parities 
C Data sent out 

C one read retry 
t write back to memory 
t switch back to read 
C reset flags 


READ 5 

READS: P <«- PAR(w); 

MCOHPL 1; 

-*> WAIT 

READ6; RTRONE *■ 1; 

BP +- w; 

RW •»- (1.0); 

ESR •*- 0; 

SECI Ij,; 

MINT . 

READl 

WRITE Operation 

C SYN is the syndrome generation function 
C w - (MDRq MDRj^) 

C No checking of single and double errors performed 
WRITEI: if HAM t hen {MCR SYN(w)}; 

C generate syndromes 
ESR, 


„„ APC + APC , DPC- + DPC,, SNC- + SNC,, 

1,5,6 * 0 10 1 

NWRItTT WRITE; 


4-15 





It is assumed that the relevant control signals, generated 
by duplicated controllers, are compared in each step using a morphic 
comparator. If an error is detected, the memory operation is termi- 
nated after setting control error status bit ESR^ and memory interrupt 
indicator MINT. 



I 


Ite — fuj m Ttii **init nf rm|i" 

tot Soft Wana CSggl t 


t 8NB|^ f It 


ssii2t sm f waft 

A 


S 8 N 3 t ^ 


i Pa pl lc a to d toft mm 

to«totm8n^/n^ 

i tot oBNi ngtotor to oil 
1*0 to diedt looi oigooli 

i Load soft aom 


9K^ ^tlBL , i toft name chock oncor 
I B8Rj •<- 1 { 4 tot status bit 

HINT <«• 0^*1 4 Interrupt 

HCCNfPL <-1; 4 Tenilaate 


Read Error Status Register (RBSj 


RBSIs OUTEN 1; 


MCOMPL 1; 


4 Set connect flag 

(reset on M8TART going low) 

4 Connect B8R to DB 
^transmit error status) 


Read Error Word Address (REA) 


4 Set connect flog 

(reset on MSTART going low) 


REA1: 


OUTEN 1; 








fsmf^ 0} 






♦ WOT 


RBPli OUTER •*- It 


MDOMn. *■ It 


( Connect MCB to DB 


Read Check Bits (RGB) 


RCBi: OUTER 1; 


MCR C t 


MCOMPL ♦- 


C Connect MCR to m 


Enable/Dlaable Read Retry (EDR) 


EDRl: RTRY<^ABjj: 


C ABjj ■ 1 to enable 


HCOMPL ^ 1; 


WAIT 


ABjj - 0 to diaable 
retry 


4-18 




t«di Bit Witt a/b (MP) 

Btrn A»„ Jtea 

i 

tbife ' 

la tto MuMqf ••4 b 

It th«i 

|roSA8|^ inp;| 

MfXmPL •«■ 1{ 

’*> HAIT 

ttmamt i-th Bit BeplaceaMit a/b (BML) 


RBR1: U 

AB|| then 

• 


jPOSAR^ 1;| 

4 All 1*8 indicate a non- 



existent bit plane. 

If 

then 

{POSAR^ <*• 111 



MCOHPL •«- 1 ; 

WAIT 

EnBbIe/Dt»able Single Error Correction (SEC) 


SECEN AB^,: 

4 AB^i ■ 1 to enable* 

MCOHPL 4- 1 ; 

AB^I • 0 to disable 

WAIT 

single error correction 


4-19 


4.1.3 Error Cootiol CapabUltlM 

Ttio oddroost data aad ooMBaada act ayataaatleally dMokad 
attiMt B^igla Md doubla arrora aatag appropriate aacodlog actkOBeo 
(byta«parlty« norfdilc aad aa 8BC/DBD coda) aad aalf-ehacklag chackara. 
TlM tafoKnatlTO dboat coda errors aad clreait faults is collactad durlog 
each aaaocy oparatloa cycle aad saved ia tita error status cegtscer B8R 
aa follows: 


BSB^ 

- 

Address Parity Error 

ESR^ 

- 

Data Parity Error 

BSR2 

- 

Single Error 

BSR^ 

- 

Doid>le Error 

BSR^ 

- 

Circuit Error 

BSR^ 

- 

Soft Name Decoding Error 

BSRg 

- 

ReadAfrite Command Error 

BSR^ 

- 

Control Error 


The error checking capabilities of the MIBB are specified in more detail 
next. 

Address Parity Checking 


APC • 

(APC^, APC^) 


APC • ^ 

((0,1) v(1,0) - Ij, 

if no parity errors 
in MAR 

Action: 

((0.0) V (1.1) • 0^ 

s. 

if parity incorrect 
or checker fault 


U (APC ■ Oj,) Z' (MCOMPL - 0 ) a CS^ then 
MIMT Ojj; BSRq 1; NCimPL 1 

- no operation on the storage array performed 

- CS^ it» a control state. 


4-20 


1MM >»gtty niiijrlna 

! l|i li «o pstl^ WfW» te ton 

0|, 11 t»>~t « telt 

Actlmit 

If (DFC * <i^ A (MOm « 0) Ijm 
mm A C8^ titan 
MIHT ♦ 0^; B8R <*• It HOQMPL <*• 1 
alaa if BIAD a Ctj tliaa 

MIST t- 0„; B8B, 1 

- wrlta oparation not parforaod on tha atornga 
array. 

Data SEC/PED Checking 

An odd-waight aaparabla alngla arror corracting and double 
error detecting (SBC/DED) coda is used to encode 16-blt 
data words on 22>bit aanory words (CART 761. The SECfVBD 
coda Is spaclflad in Table 4-1. 

A nenory word consists of 16 data bits followed by six check 
bits. 

The check bits ...» arc defined as 

Co - ^ / <(ll)«o., ,) ■ »0„) 

C, • ®r / ((MD.,^0 «> ■ ““l' 


4-21 


Table 4-1. Odd-Walghtad 8EC/DBD Coda 



Cj • $ / <0®*o,M,5,8,10,W,15^ • 

Cj • ® / «*®*o,3,8,0, 11, 12 , 13 , 15 ^ ■ ** 3 ^ 

C4 jjj4> • 8C4) 

Cj / «®», ^2,9, 10, 12, 13, 14, 15^ * ®®5^ 

l.«., (6^, DG^) IMS oid ^icy. 

TIm cteck bit is la 

Ths syadroBss 8^,..., 8^ srs dsflasd as 

- 0 / (C^, DC^) 

so Chat 

! 1 If chars is no siagla srror ia (C^, 06^) 
0 ochcrwlsa 

The analysis of syndnuMs Is iaplmcncad with aorphic logic 
in Che following cases t 

(1) Single error: 

If 0/ (8jj 85) - 1 then 

8E - 1^ 

i*e.» an odd (actually, 3 or 1) nuaher of syndroaes 
with the value 1 indicates the single error case. 

(2) Double error: 

ix <©/ (8g, • • • * ® ^ * * * ^5 " 

then DE ■ 

n 

(l.e. , a double error is Indleated by an even nuaber 
(< 6) of syndroaes having Che value 1). 


4-23 


(D M» sinslA or do^U fiMr 


“ - 'll 

Ito onmr cam is 
Che relW 1'. 


ledlhted «IM |dl eyndrooes have 


The xeadAnrlte fneannd le ^edted }gy aecphie l«^c and erter 
eatisn BSB^ •«' 1, HIRT ■«* 0^ and no aetiaa on the etocage amqr 
The out-of-range eonnaads are using two nlcro- 

prograaed eimtrol mite* checked witii morale comparatoce 
at the control signal outputs. In case of discrepancy* 

■<> 1 * MIHT <«- 0^ at the operation is teminated. The 
memory name decoding is checked by duplication and morphic 
coiqwretors. All checker circuits are checked using noridiic 
logic against sii^e errors* 


4.1.4 


Design of Mmory Interface Building Block 


As indicated in Section 4.1.2* the memory system consists of 
two sections: the Storage Array (SA) composed of a set of conmierclally 

available memory chips* and the Memory Interface Building Block. 

The Storage Array consists of up to 22 active bit-planes* 

denoted BP^* i * 0*1 *...*21* which are used for storing 16 data bits 

and six check bits. The check bits are defined by a modified H«niin<«g 

SEC/DEO code for which relatively efficient implementation with good 

coverage can be specified. There are two spare bit-planes* SP and 

a 


All bit-planes are identical and contain up to 4 (4K x 1) 
basic memory chips with on-chip decoding. The reconfiguration is per- 
formed by replacing a faulty-bit plane using a direct spare demultiplex- 
ing replacement scheme* as described later. 




nie ItaBoty latttfaM ftitl4lBg Block Is ^urUklosad la foar 
sectims (sM Pigucs 4-1) wBl^ acs BsserilMid la detail la tM faUswlag 
paragraphs. 

4*1*4«1 Address Bus latarface. nie Address Bus laterfaoa (ABt) 
sectlm* idilch provides Che addrras psricp dieokl^ sad teeodlag reqelrw^ 
to select a nnsory notele» Is shoua la Figure 4-3. SpeelfiiMlly* it 
receives a 16-blt ueaory address oteoiM vitii too hpte-parltp hits freji . 
the Address Biui and storra it into tiM ttaaorp Addrras Register ^AR). 

The self-checking parity dieck er cleeett (APO Is raed to validate the 
address before a read or write operation is perforaed. If no errors are 
detected* the low-order 12 bira are seat to the .Storage Arr^ Block 
where the Independrat* oa-^lp decoding is pefcSorned. A fralt la one 
on-chip decoder nay cause access to a wrong location to occur* but this 
will be detected and corrected by the data-word 8BC/ra> code. Riallarly* 
two decoding errors will be detMted by tiie SK/DTO sCheae. No dieting 
tlon is made between errors caused by faults In m^chlp decoderb or 
storey cells. 

The decoding of the hlgh-order (0-2) address bits* tdilCb are 
used to select a nodule within the Storage Ariray* are diecked by a self- 
testii^ decoder. Alternately* a separate decoder can be associated with 
each bit-plane* thus nafcing It possible to use the data-word error code 
for correction of single bit errors and detection of dotd>le bit errors 
In the address decoding. Tlie high-order* module select bits are used 
as "soft" names and must be mapped into the physical module address. 

The design of the Soft Name Checker (SNC) is given in 
Figure 4-4. The Address Parity Checker and the 5- input morphlc compara- 
tor are shown in Figures 4-5 and 4-6. The Error Word Address Register 
(EHAR) is used to store the address currently being referenced. If a 
fault should occur the EWAR c«i be read out for subsequent diagnosis. 

The block labeled ORC detects out of range commands to the MIBB. It is 
shown in Figure 4-6. 


4 - 2 ‘) 



Figure 4~3. Address Bus Interface 





























4 J.4.2 latatiifla. 9 m 8w • fUPMji 

Art«y Xatarf«M (OBZ) eoMlAM «f Mlt atMtet 0MII» 


OM) flttd the BeplMMWtt en^tnl iwU0a 0C8)t i lit fijMee 4ffK 

k Data Bit Module (BM) la B>^lt wide. Xt aaoaiata at a pavtiaa aX tte 
MoBory Data B^latar Olitt) aad naMAa for iatarfaaiag MDB with rasular 
tad apara bit-plaam* The daalsaa af subBlaaka are indtaatad ta 
Piguras 4-8 to 4-13. The r^pIaaoBoat of a inltf bit-plana la daaa by 
decoding replaaeaient c^iatara (POSMt^) (Figure 4-14). ^ 

declalon wbidi bit-plane to rq^laae ia nada by the ayatai. On the baaiU 
of error iafomation (location af laat fanl^ bit), the ayatcB aaads Oia 
corresponding R8P connaad aid loads FOSAB^ (fOSkSL^) ailb tbk bit-plans 
position code. A correction iapnc is used to allM tiia error correction 
subsysten to conplement an erronaons bit. The concurrence F08^ and 
causes the specified bit to be r^lSGed by spare-a la the 1th MM 
or cm. Siailarly oiablM replacanent of the internal bit speci- 
fied by POS^ using spare plane 8|^. The signal 39 oiables correction 
(^sversion) of the bit specified ^ ) 2 ^4thin the selected Dm (or 
cm). The signals 8^*8^ are the Hanning code syndromes to be inserted 
in the check bits during store operations. 


4. 1.4. 3 Brror Control Section. The Error Control Section (lICS), 

shown in Figure 4-15 is responsible for generating Hamming code check 
bits and syndromes (SGC) (see Figure 4-17), byte-parity generation and 
checking (DPCG) (see Figure 4-16), and error analysis (SDA). The cir- 
cuits used in BCS block are also self-testing. The single bit error is 
corrected by a decoding syndrome generated from the word contained in 
the Memory Data Register (MDR) in order to localise the faulty bit i. 
the correction is performed by reloading MDR^ with the faulty bit com- 
plemented. The correction mechanism can be disabled on system request 
to preserve the data information for systems diagnostics. The byte- 
parity checking provides for detection of most frequent errors in the 
bus and Interface circuits. 


4-29 








Data/Check 








"b 'h 



Bsgister 







"it ■*»! 



Figure 4»10. Meaory Date Register •* Check Bit Module 










eOVOAItSt -M 

Figure 4-11. Bit Interface Module (3X) 



Figure 4-12. Bit-Plane Interface Module (2X) 


- 3 - 






4-15* Error Control Soction (ICS) 
















REPEAT FOR 

(S^»S2|) 

<53o»S3|> 

^40'*4I^ 

(SjQrSji) 


INPUT 

MDR (5,6,7,8,9, 10« 1 1 « 12« 17) 
MDR {D,2,4,5,8,10,14,15,I8) 
MDR (0AM.I1#I2J3,15J9) 
MDR (1,3«4,7,9,IM3«14,20) 
MDR (1 ,2,9, 10, 12, 13, 14, 15,21 ) 


EQV GATES: ~ ISO (TOTAL) 


Figure 4-17. Syndrome Generators/Checkers (SGC) 6X 


The error analyzer receives the inputs from the following 
functions: data-word error coding; data-word byte-parity checking; 

address-word byte-parity checking; all self-testing circuits and check- 
ers of duplicated units. The output signals indicate the conditions, 
such as NE (no error), SE (single error), DE (double error), CE (circuit 
error), and they are recorded in the Error Status Register (ESR) which 
can be transmitted over the Data Bus on system demand. The specifica- 
tion of the fields and information to be recorded in ESR should enhance 
the systems diagnostics and maintainability of the memory system. 

The design of ECS follows that of Carter et al. 



The motphic XOR trees are used in checking md generating 
check bits as folloust 

In READ operation, the outtmt 8^ repreasttta the 1-th 
syndrome 

8j -©/(Cj. DCj) 

and DG^ represents 8 Ifl)R posi- 
tions as defined on the diagrsn. The signals 8^^, 
are morale outputs for the i-th syndrome. By 
definition, • 1 if there are no single errors in 
Che positions corresponding to aid DG^. 

The Carter SEC/DQ) analyser, shown in Figure 4-18, performs 
the checking of syndrome generation by morphlc signal SGC, which is 

n 

if there is no error in any of the syndrome generators and 0^ otherwise. 

This is so because odd parity is used in the encoding. Two parity trees 

are used to produce a morphlc syndrome parity check (SPC) . Since there 

is an even number of syndromes and parity bits and the syndrome "no 

error" condition is 1„, there should be an even number of 1's in total 

n 

under no error condition. Therefore, both parity trees should have like 
parity and SDC is 0^ under a no error condition and otherwise. 

From morphic signals SGC and SPC it can be decided when 
there is a no syndrome error (NE), a double error (DE) or a single error 
(SE). These conditions are mutually exclusive and that fact can be used 
to provide for checking analyser circuits as indicated by the No Circuit 

‘i 

Error (NCB) network. 

The morphic error indication signals are systematically 
collected in an 8-bit error status register (ESR) (see Figure 4-19). 

On the basis of address and data parity checking, SES/DED analyzer out- 
puts and command/control checking, two outgoing signals, are formed. 
Whenever a single error has been corrected, a morphlc interrupt signal 
SECI is generated. If an uncorrectable condition exists, the memory 
error interrupt (MINT) is generated. If MINT condition exists, a write 
operation is prevented. 



4-39 



SEC/OBD Analyser (SDA) 



















LQA0E1 



Error Status Register and Memory Interrupt (E8R/ISI) 











4. 1.4. 4 Mamory Control Section . The Mcnory Control Section (NC8) 
provides control signals required to implement operation and command 
algorithms. As indicated in Figure 4-20a, the HCS consists of the 
following suboactions: the Control Interface (Cl)* the Clock Generator 

(CPG)« two Condition Generators- (KG^« State Sequences (SS^t 

SS|^)t two Control Signal Generators (CSG^, * Control Signal 

Comparator. (CSR). These subsections are described in more detail In 
the following paragraphs. 

(a) C ontrol Interface (Cl) and Clock Generator (CPC ) 

The Control Interface (Cl) la shown in Figure 4-20b. It 
consists of SCCH-MIBB handshaking circuits (MSTART-MCOHPLBTE 
circuits), and several flags at the out-of-range command 
register with the command decoder. The Clock Generator, 
also shovm in Figure 4-20b, consists of the basic 8MHs 
clock oscillator, a synchronising divider which produces a 
4MHz clock train in automatic mode when MSTART*!. In the 
manual mode, a single edge is produced. (It is assumed 
that all flip-flops are edge- trigge red. ) 

(b) Condition Generator (KC ) 

The conditions generated by KG are defined below: 

* HAM* SER • SECEN* NCER* RTRY 

K 2 - HAM* SER* SECEN*NCER* RTRVRTRONE 

Kj • HAM*KRR*SER-fH^*E2R+RTRY*RTR0NE*ERR 

K, - RES+RFJV+REP+EDR+RSP+RBR+SEC 
4 

K, - ORC *MSTART 
3 a 

K, » NWRITE+SSN 
o 

- NWRlTE*Kj^+SSN 

* F0RME*MSTART(WRITE4-NWR1TE) 

(I^+K^+K2)NWRITE 

NWRITE+WRITE 

Tht* implementation is straightforward and is not shown here. 
^ Sequen cer (^S ) 

State Sequencer (SS) implements the control state diagram 
shown in Figure 4-20c. The t states correspond to the steps 
of the operation and command algorithms given before, as 
lol lows: 


4-42 



Figure 4-208. Control Section 








Figure 4-20b. Control Interface and Clock Generator (C1&C6) 













HAIT 

DBOODB 

t2* RBAOl, BCBl 

tj SBAD2, 8SN1, KB81, RBAl, BBPl, BDRI, RSPl, RBRl, SBCl 
RBAD3, S8N2 
tj RBADA, 8SN3 
BEADS* imiTBi 
RBAD6, WRITB2 

The states t. and t« take two (2) clock periods in order to 
Z 0 

accoiniKidete the access to the storage array. The implemen- 
tation for the breadboard uses a standard synchronous con- 
nector plus multiplexer approach. In a VLSI implementation 
it is likely that an asynchronous sequencer would be more 
appropriate. The state transitions of the counter T, 
shown in Figure 4-20d* sre controlled by the following 
functions: 

TCOUNT-t„*K^+t, •RCB-et,*lJWRITE+t,*K,+t. ‘K^+t. *NWRITE+t . *WR1TE 
051 Z 504/5 O 

TLOAD - 1 _ • K_+t , * • RCB+t - * K_+t . • K. _+t . • SSN+t , *NWRlTE+t , * K , , 
082 394 10 5 6 711 

TCLEAR-PR+TQj (i.e. for all t^, i>8). 

The parallel load inputs are defined as follows: 

Next State 


Present State 


*2 


^0 

Cond it ion 


0 

0 

1 

0 

NWRITE 

*^0 

0 

1 

1 

0 

WRITE 

^1 

0 

0 

1 

1 

R^ 


0 

0 

0 

0 



0 

0 

0 

0 



0 

0 

0 

0 


*^4 

0 

1 

1 

1 

s’ 


0 

0 

0 

0 



0 

0 

0 

0 



“ o' 

0 

1 

0 

NWRITE 


0 

0 

0 

0 

WRITE 


Therefore, the parallel Inputs are: 


1., - t„*WRlTE+t ‘K., 
JO 4 2 


4-46 


NWMTE 



4-47 


Figure A-20d. State Sequencer (SS^) 






Iq • t^SCBi^^K2 

Th« sequancer is shown in Plgurn 4-20d. For additional 
flaxihility in Cho broadboarding ghaso, we uaa Nod^lb 
countar and I6>to~l oultipiraara avm though Nod-8 countar 
and 8-to-l nultiplaxara would ba sufficimt. 

(d) Control Sinnal Canerator ((^0) 

The control signals for ragistar and salaction networks are 
defined below. Again, the implttoentation is siinpla and it 
is not shown here. Since there is a large mnaber of control 
signals (approx. 60), direct morphic reduction would be too 
costly. However, it is possible to group together (by ORing) 
autually exclusive signals before reduction. For bread- 
boarding phase, a direct signal-to-signal comparison on 
equivalence is preferred. A control error Is indicated if 
not all comparisons are the same. 

EBAR register 

SETEB-PR+t y • NWRI TE+MSTART • MC(»IPL 
U)ADEB-tj-NWKlTE 

EWAR register 

LOADEW- ( t - * • NWRI TE+t , * • WRI TE) FRSTEW 

FRSTE W flag 

a.EARFRSTEW-PR+t j • REA 

SETFRSTEW-t, *NWRITE*K„+t,* WRITE 'ERR 
A 3 7 

E register (morphic) 

SETE«PR+ty*NWRITE 

1.0/U)E 1 * t . * • NWRITE+t , * • WR I TE 
I b 

U)ADE2=t^ 'NWRlTE+t,** WRITE 

J D 

l.OAUE3«t*+t',t* 

J. b 

I.OADE4=*t^-NWRlTE 


4-48 



B8R 

register 



CLBARB8R - 

PRH^*NHRlTC-HeTAirr*ltCOHPl 


LOAOBRRq - 

t2**IWRITB+t^**«RITE 


LOAOBSR^ • 

tj*NURimt^«HRlTB 


LQADBRRj - 

tj*lWRITB+t^*WRlTB 


LOAOR8R3 - 

UOADR8R2 


LOADBSR^ - 

LOAOBSRj 


LOADSSRj ■ 

LQAOESRg’t-Cj'SSN 


LOADESR^ - 

LOAOESRq 


LOADBSR^ - 


MCOMPL 

flag 

. 


SETHOOMPL - 

PR 4 -t ^ • HMRITE • K^+t • NWRlTB 4 t 
WRlTB 4 -t j • SSN+t j • K^+t 2* • RGB 


CLEARCOHPL- 

tHSTART (F 0 RME 4 ORC. ) 

a 

MAR 

register 



tOADMAR 

t^(FORME+ORC^ • RGB) tMSTART 

MDR 

register 



LOADMDR - 

tp* FORME'MSTART •WRlTE+(t2*+t 
t^.K2)*NWRITE 

MCR 

register 



LOAWiCR - 

t,*‘WRlTE*HAM+t,**RCB 
0 i ^ 

OUTEW 

flag 



CLEAROUTEN- 

PR+MSTART« MCOMPL 


SETOUTEN - 

1 3 (NWRITE+K^ ) +t 2 *RCB 

RTRONE 

flag 



CLEARRTRONE- 

PR+»CTART* MCOMPL 


SETRTRONE - 

t^'NWRlTE 


4-49 






ORCR 


P08AR 


P 



flag 

GLBAEBTBY • PR 
LOADRTRY • tj^BDR 

ragistar 

LOADORCR - tQ*ORC^*HSTART 
SBTPOS - PIH‘t.«RBR*AB.. 

A J 11 

SBTPOS^^ - PlM-tj*RBR«AB^^ 

LOADA - tj*RSP*AB^^ 

LOAOB - tj*RSP*7I^^ 

ragistar 

LOADP - tQ*FORMB*MSTART*WRITE+(t2*+t^*)* 
NWRITE 



ragistar 





SETRW 

■ 

‘NWRITE 



LOADRU 

m 

t^«FORME‘MSTART 



CLEARRN 

m 

MST/U1T‘MC0MPL 



RESETRU 

m 



SECI 

register 





SETS 

m 

PR+t^ • NWRITE+MSTART'MCOMPL 

1 


LOADS 

■ 

tj‘NWRlTE 


MINT 

register 



• 


SETM 

- 

PR+t y+MSTART • HCOHPL 

) 

! 

t 

1 


LOADM 

• 

t^‘Kj+t^+tj‘SSN 

SECEN 

flag 



? 


SETSECEN 

- 

PR 

1 ' 


LOADSECEN 

m 

t j-SEC 

i 

t 

SNH 

register 



i 

? 


SETSNR, 

o 

m 

t^-SSN 

c 

i 


LOADSNR 

m 

t, *SSN 

i 


t 


4-50 


MOUT 

Signals 




oum 

m 

OUTER* RBA*IIQOHPL 


0UTB8R 

m 

0UTBR*R88*HaMPL 


OUTMCR 

m 

OUTBR*RCB*MCOMPL 


ounoR 

m 

OUTBR*RHRXTB*tlOOnPL 


OUTBBA 

m 

OirrSM*RBP*MOOMPL 

8ELDH 

Sslsctiwn 




8ILDW 

m 

k2**RREITE 

MZMEN 

(Hswry bMbls) 


NBMBN 

■ 

ERR 

ERR 





ERR 

m 

NHERR^ONMERR 


SER 

m 

SEqOSB^ 


NCER 

m 

NCEqONCE^ 


4.1.3 Esttwtgd Coapia xlty of I»pl«»«ntBtto n 

The design of the MIBB was directed toward partitioning 
into LSI modules of 300-730 gates per module. This has been largely 
achieved, as summarized in Table 4-2. It is also small enough to be 
implemented on a single VLSI circuit. 

The breadboard realization using SSI/MSI modules requires 
about 200 chips. 



Table 4-2. 

Component Count 


Module 

Equivalent Gates 

I/O Pins 

LSI Chips 

ABl 

' 463 

• 64 

1 

ecs 

. 650 

• 100 

1 

DBSA 

^ 775 

- 64 

1 

MCS 

^ 500 

• 64 

1 




(Oupl icate) 


-V3000 


5-^2 


4-31 


i 

5 


4.2 


TBB CORE BUILDING BLOCK 


The Core Buildiog Block (Core-BB) is responsible for 

(1) detecting CPU and bus faults* (2) collecting fault indications fron 
the other building blocks* mid (3) disabling its computer module upon 
the detection of a permanmit fault. Two fault-handling options are 
provided by the Core-BB: 

(1) Stop at the first fault Indication; 

(2) Rollback at first fault indication* stop if fault 
recurs 

4.2.1 Core Building Block Requirements 

e 

Specific requirmnents of the Core Building Block are listed 

below: 

(1) Compare two CPU's for disagreement; 

(2) Parity encode CPU output for internal bus transmission; 

(3) Check parity on internal bus; 

(4) Recognize Core-BB commands: Halt and Inhibit, Restart, 

and Enable, as out-of-range addresses; 

(5) Allocate the internal bus amongst several DMA modules; 

(6) Detect internal faults within the Core-BB; 

(7) Collect internal fault indications from all building 
blocks within the computer module; 

(8) Disable SCCM output (or set error message) under fault 
conditions ; 

(9) Provide reset/halt, or reset/rollback, capability for 
optional transient fault recovery; 

(10) Halt computation on recurring faults. 

A block diagram of the Core-BB is shown in Figure 4-21. 


4-52 



Figure 4-21. Core-BB Block Diagram 






4. 2. 1.1 Core Building Block Coooectlone. The following is a listing 

of Core-BB connections and a brief description of their function: 

(1) Internal Data Bus (DB) 18 lines (16 2 parity) 

(2) Internal Address Bus (AB) 18 lines (16 2 parity) - 

All building blocks and the Master GPU are connected 
to these buses tihich tie together to the SCCH. The 
Core-BB checks parity on outputs from other modules 
and generates parity when the Master CPU outputs on 
either bus. 

(3) Local Data Bus (LOB) 16 lines - Special data bus to 
the check CPU. The Core-BB passes data directly from 
the Internal. Data Bus to the Local Data Bus when 
Inputs are required by both CPU's. When both CPU's 
output, the Core-BB compares the two processors by 
comparing the two data buses. 

(4) Local Address Bus (LAB) 16 lines - Carries address 
outputs by the Check CPU which are compared with 
address outputs of the Master CPU (by comparing the 
Internal Address Bus with Local Address Bus). 

(3) R1-RS (5) - Bus Request signals from DMA controllers 
in other building blocks. 

R1-R3 (5) Complement of R1**RS 

These signals form morph ic pairs (R^, R^ , .... R^, R^), 
which are sent from up to five SCCM building blocks. 
They are checked for proper coding (i.e., being com- 
plementary). The true values (R1-R5) and the comple- 
ment requests (R^-R^) are processed by two redundant 
circuits within the Core-BB, which in turn generate a 
true and complementary set of acknowledge signals 
(Akl-A5). (SiT-Ak5). 


4-34 


(6) Ak1«Ak5 (S) - Bus Grant (Acknowladta) signals to DMA 
channals. Ak1-Ak5 (S) - complainant of Ak1-Ak5 forming 
morphic pairs. 

(7) HOLD HOLD (2) - Bus Raleass raqusst to Master and 
Check CPU* 8. 

(8) HOLDAt HOLDA (2) - Bus Release acknowledge from Master 
and Check CPU's. 

(9) 1F1-IF8, lpi-TF8 (16) - Bight morphic Internal Fault 
indicators from other building block circuits. 

(10) RESET. RESET (2) • Morphic reset signals to all SCCM 
modules from duplicated logic in Che Core*-BB. 

(11) INHOUT. INHOUT (2) - Inhibit outputs to Bus Controller 
and I/O BB's. from duplicated logic in Core~BB. 

(12) RESTART, RESTART (2) - Mdrphic restart signals to 
Master and Check CPU from duplicated logic In Core-BB. 

(13) Clock In (1) - 1 Mhz square wave clock in to 
Core-BB. 

(14) (1) System Clock - Clock to all circuitry in SCCM. 
sent from and controlled by Core-BB. 

(15) WRITE, NWRITE (2) - Memory read/wriCe control level of 
the SCCM Internal Bus. 

(16) MSTART (1) - Memory Start Signal of SCCM Internal Bus. 

(17) COMPL (1) - Completion level of SCCM Internal Bus. 

Counting Vcc and ground, this circuit will require a 128 pin 

package. 


4.2.2 Core Building Block Implementation 

The Core Building Block consists of three sub-elements: A 

Processor Clieck Element, A Bus Arbitration Element, and a Fault Handler 
Element, whicli are described below. 


4-5S 


4*2*2. 1 Th« Procwor Chack Blmant (PCE) . The Proceesor Check 

Blenent eervee thtee functlonat (1) to compare the outputa of two 
eynchronous proceeaors; (2) to encode and chack Internal hue parity} 
and (3) to recognise and decode eonmands sent to the Core through the 
internal 8CCN bus* 

The PCE Is shown in Figure 4-22* It is connected to the two 
18-blt internal address and data buses within the SCOt. The Master CPU 
and other building blocks in the SCCM also reside on the*:e buses* The 
PCB provides a local address and data bus for a Check CPU* which is 
operated s 3 mchronously with the Master CPU, and its outputs are compared 
for checking. 

Internal circuits in the PCE consist of: 

(a) Morphic Comparators MCMP - Each of these circuits ctm~ 
pares two pairs of 16-bit inputs, and generates a two- 
wire output. The output takes on values 0,1 or 1,0 if 
the 16-bit inputs agree, and they take on values 1,1 
or 0,0 if the inputs disagree or if an internal fault 
occurs in the comparator circuit. These circuits are 
said to be self-checking in that nearly all internal 
faults will eventually result in an error indication. 

One MCMP circuit compares the address output of the 
two processors. The second compares their outputs to 
the data bus. An isolation circuit is provided so 
that input data to the Master CPU can also be passed 
to the check CPU. 

(b) Morphi c Parity Check/Generator - Two circuits are pro- 
vided to check and generate parity on the address and 
data buses respectively. Coding on each bus consists 
of two odd parity bits; one over all even bits and one 
over all odd bits. Since tbo Master CPU generates 

16 bit address and data outputs without parity, the 
parity generators add the extra two parity bits to 












thalr asaoclatad bus during CPU outputs. Othar data 
on tha intamal buaas Is expactad to bo eodad (o.g.» 
mmory data) and tha parity cbackara chock for propar 
coding. 

Bach solf-chocking (norphic) parity chock gonoratos a 
two^ire output with values 1*0 and which rapra- 
sent a correct check, and 1,1 and 0,0, which represent 
either uncodad data on the associated bus or a fault 
in the checker. 

(c) Command Decoders . Two cmimand decoders are provided 
which have identical outputs. When an out-of- range 
address appears on the address bus (with ABq>AB^ ■ 
1111) and the Core-BB is addressed (AB^-AB^ ” 0001), 
the three least significant bits of the address bus 
are decoded to generate six commands. These are 
designated CMD1'-CMD6 (*CMD1-*CMD6 from the duplicate 
decoder). If any of these commands are received, the 
level FORME is raised. The outputs of the command 
decoder are compared in the Fault Handler to detect 
faults in this circuitry. 

Core-BB Conunands arc: 

CMD1 - START Clock 

CMD2 - STOP Clock 

CMD3 - Initiate Rollback 

CMDA - Clear Faults, Enable Outputs 

CMD5 - Output Error Status Word 1 

CMD6 > Output Error Status Word 2 

(d) Status Registers . Two status registers are used to 

sample various fault Indicators and make this informa- 
tion available to external computer modules. When a 
fault is detected ® fault synchronizer, 

this data is sampled (i.e., clocked into the status 
registers). Two Core-BB commands are reserved to read 
out the status registers. Wlien the level OUTS goes 


4-S8 




low, trl<-atate drivera are enabled In Che reapecclve 
atatua reglater and ica data la output to the data bua. 

Figure 4-23 ahowa a preliminary logic deaign of the circuica 
which make up the Proceaaor Check Element. Specific interface aignala 
are: 


Input to DCE: 

PO -PO . 
a d 


OUTS1 

0Uf§5 

F,+F2 

PASS^ 

Outputs from DCE 

MPC , MPC . 
a d 

CMP , CMP^ 
a b 

CMDt-6 

FORME 

*CMD1-6, * FORME 


- Generate and output parity on addreaa or 
data bus, respectively. 

- Output Status Register 1 to SCCM data bua. 

- Output Status Register 2 to SCCH data bus. 

- Load status registers (a fault is detected) . 

- Connect Local Data Bus with Internal SCCM 
data bus. 


- Two-wire morphic parity check results for 
address and data bus, respectively. 

- Two-wire morphic comparison results for 
address and data bus, respectively. 

- Command lines for decoded commands to the 
Core-BB. 

- Indicates Core-BB has been commanded by an 
out of range address. 

- Duplicate of command decoder signals above. 


4. 2. 2. 2 The Bus Arbiter Element . The bus arbiter accepts internal 
bus requests (R.R) from up to five DMA channels in other building block 
circuits. It accepts multiple requests on the basis of priority, requests 
release of the internal bus by the two processors (HOLD), and upon com- 
pliance by the processors (HOLDA) it grants access to the selected DMA 
controller (Ak). Incoming bus requests are morphic signal pairs which 
take on values of 0,1 when access is requested and values 1,0 when no 


4-S9 


»A>ITV CMCCK Olf^lATl 



























request is made. Values 1f1 and 0*0 represent fault conditions* 

Similarly* the acknowledge signals are morphlc with 01* and 1*0 repre* 
seating not^acknowledge and acknowledge (grant). As before* the values 
00* and 11 represent fault conditions. One variable of each morphlc 
pair is associated with one priority resolver circuit and the other Is 
associated with the second resolver. 

The two priority resolvers are duplicated circuits* each of 
which provide the bus arbitration function — one in "true" logic and 
the other in "complement” logic. They are compared using morphlc-and 
circuits to detect faults In either unit by their disagreement. Each 
priority resolver is a simple sequential circuit tdilch accepts bus 
request inputs* obtains release of Che internal bus by the CPU* and 
grants bus access to the requesting DMA channel with highest priority. 

A fimctional block diagram of Che Bus Arbiter Element Is shown In 
Figure 4-24* and a logic description of the Priority Resolver is pre- 
sented in Figure 4-25. The morphlc-and circuits are shown In Figure 4-26. 
Synchronization of the two priority resolvers Is described below: 

(a) Timing. A square wave clock Is sent Co all building 
blocks in the SCCM and Is used for synchronization. 

In the first half of the cycle* the clock is high and 
this signal is designated The inverse of the 

clock is Thus* the rise of is the beginning of 

a clock cycle and the rise of middle. 

All bus request* interrupt* and fault indicators are 
constrained to change only at the beginning of and 
it is assumed that they are generated by a D or J/k 
flip-flop clocked with If these incoming signals 

are examined instantaneously at the rise of either 
or 1^* they are assumed stable. Transmission delays 
prevent change at the rise of and the circuits 
have settled by the rise of 


ms 

KQuins 



•us AVAIUM 
mOMCNKKCru 


, ms AVARAIU 
to MOMStlNO DMA 


INTSANAL eBtOR 


lUS AVAH. ntOM 

MAsm cru 



b) MTAIliD BLOCK OUOR<M 


Figure 4-24. Bus Arbicor Layout 






KlRuri* 4-25. Priority Rotiolvor I.oj;lc 


4-(*l 






















ONVAMSOIOR 10. A MUilY IMOIMC Mi TAICtt ON 




cofitDNruw cmut 

• • i i 

S ! 0 1 

® 1 ^ 

I 0 
I 0 1 


11® 

0 0 I 
0 


WNCOPtP WM % 

0 0 X X 
X X 0 0 

1 1 I X 
11X1 
1X11 
Kill 


I « 



AN OMMHNATION Of PfU ORCUT WtU SHOW 
INAT KM IVBtV *SIUCK* VAMAM, nc« ixnn 
AT IIA'tTONI OPKO WTUT WIOCN WIU KSUlt M 
ATAMtMOKATIONATTNeOUmngl-aeOR II. 
1HISBTNillU<NICKINOKIOWnY. ATM OF 
IMm CMCUm CAN K MED TO COMPAM A 
LAMXi NUMKR OF SIONALS 


a, SLFOaCKINC tXCtUliVt*Oll KDUCTION CKCMT OR TWO tNPlIT MORFMIC ANO 


i 

( 

MU.TIFU INFUT MORFHIC-AF« IMANO) CMCUITS CONSIST OF THUS OF THI 1 

CIRCUIT IN • AlOVI Al« RtOUa MUSTIFU MORFHC fairs TO A SINCU 
TWO-wm FAIR TAXING ON 01 OR 10 IF Aa INFUT FAOS Mt COMFUMINTARV, 

AFO II M 00 IF AU INFUTS AK NOT COMFUMINTARV OR A CttCMT HROR 
ISUNCOVmO 


3 INFUT MOOFHIC ANO 0 INFUT MOAFHIC ANO 


b) RIOUCTION TRtfS 


Finuff 4 - 2 b. Morphir ;ind Currents: 

(a) Sflf-n»eiklii(s exclusive, or Reduction 
Circuit; (b) Redoitlon Trees 


4-^4 



i 















Incoming bun rnqunnt pmirn md outgoing neknowlndgo 
pairs am taducod with S>poir oorpbic AMD circuits to 
ganurato tuo 2-wira wrpbic chock signals. M>XM 
vorifios that tho inputs am corrset and MM) chocks 
that the duplox arbiters agroo. Thoso ch ock signals 
am oxaalnod pmcisaly on tho riso of la tho Fmlt 
Hai^liag Logic. 

<b) lay laaontat ion . Two priority moolvor iaplOBontationo 

aro given; ono with a PLA and tho othor ia built 
around a priority oncodor chip (74278). Tho PLA ia 
batter for VLSI layout, and the other approach is 
easier for breadboarding (aee Figure 4-25),. 

The resolver inplenentscion is straightforward and the 
logic is largely eelf-explanatory. One additional 
feature is an added flip-flop tfhicli has a subtle but 
important purpose. When the system is RESET upon 
error, the CPU will not necessarily release Itself 
from the bus. Thus we force isolation of the processor 
with tri-state transceivers and must also generate a 
hold acknowledge HOLDA signal. Upon detecting a 
permanent fault, a latch is set In the resolver which 
generates a continuous HOLDA signal. Tt is only 
released upon a command to restart the (^Us In a 
program rollback (RESTART). 

4. 2. 2. 3 Combining Fault Indicators and Other Synchronised Morphic 
Check Signals . There can be up to eight morphic internal fault indi- 
cators from external building blocks. These signed pairs make transi- 
tions between values 0.1 and 1.0 if their associated building block Is 
working properly. Values 00 or 1 1 indicate an Internal fault. 

Ttiesc signals are reduced by an 8-pair morphic-and circuit 
to produce a s.ngle morphic internal fault indicator MIF. Since the 
internal fault and bus arbiter check signals are all synchronized with 
the clock, they can be combined Into a single 2-wire mcrphlc fault 
indicator. Thus, NIF, MOIN, and MOO at^ combined with a 4- input 


4-6S 


ffiorphic-aad circuit to produce a aingla C8MF fault indicator which as 
a coniblnatlon of all Clock-Synchroniaad Morphic Fault indicators. (An 
additional synchronous morphic indicator MRS Ic Includad which is tha 
result of comparing the outputs of two duplex Recovery Sequencers in the 
Fault Handling Logic.) 

4. 2. 2. 4 The Fault Handler Element . The Fault Handler Element is 
responsible for overall fault detection in the SCCH and is also capable 
of taking limited recovery action. It consists of two major parts; 
duplex Fault Synchronizers, and duplex Recovery Sequencers. Both parts 
are duplicated and compared to provide fault detection, as shown in 
Figure 4-27. 

Each Fault Synchronizer examines morphic fault indicators 
and check signals from the other building blocks and from within the 
Core-BB itself. Its primary function is to examine these signals only 
when they are stable and valid to detect faults, and to deliver a Master 
Fault Indicator to the Recovery Sequencer pair. 

The Recovery Sequencer (upon receiving a Master Fault Indi- 
cator), disables outputs from the SCCM and resets the CPU's. Optionally, 
a restart can be attempted, and if successful, the software can re-enable 
outputs and clear the fault indications in the Fault Handler. 

Either of the Fault Synchronizer-Recovery Sequencer pairs 
can disable outputs from the SCCM. Also Recovery Sequencer outputs are 
compared and a disagreement is signalled to both Fault Synchronizers. 

A logic diagram for a Fault Synchronizer and Recovery 
Sequencer is given in Figure 4-28 and is described below: 

(a) The Fault Synchronizer* This circuit examines the 

various morphic fault indicators (CMP, MPC, CSMF) at 
times when (1) their checks are relevant, and (2) when 
the morphic signals are stable (not changing). The 
clock-synchronized morphic fault indicator CSMF is 
examined at the rise of every . The comparison and 
parity check pairs are examined when a bus completion 
signal (COMPL) is observed. The five morphic check 


4-66 



Block - Interconnection Diagraai 





































WHSN FAUiT OCCURS 















K BM\ra 


signals (Q1P^» ^^ 0 * input to a 

not exclusive-or function which yields a logic 1 out- 
put whenever these signals indicate a fault hy taking 
on values 11 or 00. These signals are "anded” with a 
set of signals which Indicate the times at which each 
specific check is relevant, and the result is fed to 
a flip-flop which is clocked at a time when the results 
are stable. Conditions for examining the morphic check 
signals are given below in Table 4-3. 


Table 4-3. Conditions for Examining Horphic Check Signals 


Signals 

Function 

Enabling Function 

Strobe 

F.F. 

MFC 

a 

- Address bus 
parity check 

HOLDA - processor 
off the bus 

COMPL - bus 
completion 

^2 

MPC^ 

- Data bus parity 
check 

HOLDA 4- READ 

COMPL 


CMP 

a 

• Compare check 
CPU & master 
CPU outputs to 
address bus 

HOLDA 

COMPL 


CMP, 

d 

- Compare CPU*s 
outputs to data 
bus 

HOLDA* WRITE 

C(»1PL 


CSMF 

- All clock 
synchronized 
morphic fault 
indicators 

at all times 

^2 



A flip-flop is associated with each fault indicator 
which is set if a fault is observed. An additional 
fault flip-flop (f^) is included, which is set if 
(1) the bus signals MSTART and COIDPL occur in improper 
order, (2) too much time elapses between memory com- 
plete signals (COMPL), or (3) a program rollback is 


4-{>9 


•xtcrnally conmanded (CPSBT)< All six fault flip-flops 
are synchronised with (by tj) and combined to pro- 
duce a Master Fault Indicator F which can only be set 
at the rise of 

(b) The Recovery Sequencer (RS) « Upon detecting a fault 

(F), the RS (1) inhibits outputs* (2) generates a four 

clock pulse reset signal* and (3) for the first fault* 

commands a program rollback/restart (see F * F. 

a D 

sequence in Figure 4-28). 

When a “first" fault occurs, the RS Inhibits outputs 
and issues a five pulse sequence. For four clock 
periods* a RESET signal is generated, followed by a 
RESTART to the CPUs to attempt a program rollback. 

For subsequent faults* the outputs remain inhibited 
and a reset is generated, but no additional RESTART 
is generated. (This effectively halts the CPU upon a 
second fault committed while trying to roll back.) 

If the rollback is successful* a program command can 
be issued (CMD4) which clears all fault latches 
(CLEARSEQ), re-enables outputs* and thus provides 
complete absolution for the remission of faults. 

(c) Control Signal Generation . One of the two Fault 
Handler Elements contains a small circuit for control 
signal generation. Internal Control Signals are 
generated in the following fashion. 

PASS . (Pass data to check CPU) - HOLDA -t- READ 

V 

'—don't care 

PO (Generate address parity) ■ HOLDA 

A 

PO^ (Generate data parity) • HOLDA*WRlTE 

+ READ* FORME -MSTART 


4-70 


RBAD*FOBHB*MSTm*CMDS 


OUTS^ - 

OUTS 2 • RBAD*FOBMB*M8TART*CMD6 

COMPL (generate completion) ■ FORME *MSTART (delayed) 

(d) Manual and External Module Control - This snail circuit 
provider for clearing of fault latches in the Fault 
Handler and for initiating program restart. These can 
be carried out by front panel switches or under program 
control through out-of-range commands. Also included 
is a master reset switch and a facility for single- 
stepping the SCCM clock for test and debugging. The 
logic diagram is shown in Figure 4-29. 

4.3 THE BUS INTERFACE BUILDING BLOCK (BIBB) 

The Bus Interface Building Block provides the mechanism by 
which information is transferred between computer modules via the inter- 
communications bus system. The BIBB can be programmed as a Bus Adaptor 
or as a Bus Controller, as previously described in Section 3.5.4. The 
following sections provide a more detailed description of the require- 
ments, functions, and implementation of this building block. 

4.3.1 Bus System Requirements 

The choice of a bus system for the fault-tolerant building- 
block computers requires careful consideration of functional character- 
istic.'. so as to meet a wide range of applications, which is to say that, 
it must be useful as well as fault-tolerant. Therefore, the following 
general chaiacterlst ics have been provided in the bus system. 

(1) Formats. The Building Block Bus System (BBBS) 

utilizes 1553A formats to maximize compatibility with 
planned and existing equipments. This also defines 
speed and electrical characteristics. The 1553 for- 
mat contains status messages required for fault- 
tolerant implementation. 



4-72 




(2) Memory-to-Mwnory TrmamlMion . The bus system Is 
capable of moving data blocks directly between 
memories of the connected computers « using cycle* 
stealing techniques to minimise software support 
requirements. 

(3) Indirect Addressing . Within each SCCM of a network, 
various areas of memory are reserved for incoming or 
outgoing information blocks. These data blocks can be 
reached through the bus system by absolute mosory 
address or by "name" through indirect addressing. 

In the first case, a typical bus command is "Move a 
5-word block from source SCCM 5 location 200, to 
acceptor SCCM 3 location 3000." In the second case, 
the bus command would be, "Move 5 words from source 
SCCM 3, pointer 1 to acceptor SCCM 3, pointer 2." 

In the Indirect addressing case, the computers main- 
tain a pointer table within their own memory which 
contains the addresses of the relevant data (and which 
is referenced by the BIBB) . In our example the first 
pointer table entry in module 3 would contain the 
address 200 and the second pointer in module 3 would 
contain 3000. 

Indirect addressing is Important because it allows 
decoupling of the specif icatldh of global data blocks 
from the detailed assembly listings in the host SCCMs. 
Thus, software can be changed in one computer without 
affecting the data references in the other machines. 

(4) Multiple Acceptors . The data bus is capable of trans- 
mitting information blocks from the memory of any 
source SCCM to the memories of one or more acceptor 
SCCMs. Since multiple acceptors are not directly 
provided in the 1553 format, additional modules must 
be commanded to "listen in" on a 1 553-termlnal-to- 
terminal transmission. This preserves the 1553A 


4-73 


format whlla allowing a **broadcaat'* mode for diatrl* 
butlng tima and anginmaring maaaurananta of ganaral 
Intaraat. 

(5) Block Langth* Hia maximum langth of mamory blocks 
transmitted between computers should be at least 
several hundred words In order to transfer files of 
collected data (for a number of information collection 
systems). This is implemented by allowing the concat** 
enatlon of 32-word transfers (the maximum number 
allowed in the 15S3A format). Long block transfers 
are implemented as a sequence of 32-word transmissions 
in sequence followed by a final block of less than 

32 words. This chopping up of long blocks into 
32-word segments is carried out by the bus system in 
order to preserve 1553 compatibility. 

(6) Universal Hardware Interface. The bus system inter- 
face with the host processors should be sufficiently 
general to be applied to any of a large number of dif- 
ferent host CPUs which may be employed. The most 
standard Interface that we could find Is memory-mapped 
I/O. The BIBB communicates with the SCCM through the 
18-bit internal address and data buses, using direct 
memory access (DMA). Control of the bus system by the 
host CPU, occurs using out-of-range addresses (memory- 
mapped I/O) as coounands. 

(7) The Bus Controller. The Bus Controller performs the 
Bus Control functions associated with the 1553A for- 
mat, along with the augmentations described above. 

The Bus Controller is given a pointer to a bus control 
table in the host SCCM's memory by an out-of-range 
store Instruction. The Bus Controller extracts the 
control table from the memory of the host module, 
interprets the bus table, issues bus commands to effect 


4-74 


the requested block trensfett carries out data block 
transactione involving its oun mamoryi md monitors 
status signals. The host 80CM is notified of correct 
or erroneous transmissions through a status msssage 
left in mmiory and receipt of an interrupt upon 
completion of the transmission. 

(8) The Bus Adaptor. Bach Bus Adaptor moves data into and 
out of its associated SCCMS memory as requested by the 
controller of its associated bus. 

(9) Requirements for Fault Tolerance . General requirM&ents 
of the bus system to ensure fault tolerance are: 

(a) Protection against "party line damage" of bus 
shorts or a bus interface calking out of turn. 

(b) Detection of errors in transmission and (i) noti- 
fication of the SCCH by the Bus Controller* 
through status messages, and (ii) providing a 
mechanism to allow the acceptor module to deter- 
mine that it has received an Incomplete or 
erroneous message. 

(c) Detection of Internal faults in the Bus Control- 
ler and notification of its host SCCM. The 
Core-BB disables the SCCM under this condition. 

(d) Detection of internal faults in a Bus Adaptor 
and disabling its subsequent function. (This 
does not disable the SCCM since other redundant 
Bus Adaptors may still be functioning.) 

(e) The use of redundant buses and host computers 
so that messages can be rerouted in case of bus 
failures, and computations can be relocated in 
case of computer failures. 


4-7S 


4.3.2 Bus Concrolisr Functions 

The Bus Controllsr (BC) is activatsd by a stora instruction 
to ona of a sst of out-of-rangs addrassas. It uses the value of the 
word being stored as a pointer to a Bus Control Table in the host nmnory. 
The BC reads the Bus Control Table from mmtory by cycle stealing and 
carries out the requested transfer. The BC issues those 15S3A commands 
necessary to execute the requested data transfer over an external bus* 
and monitors the associated status words to verify that the tranafer was 
properly completed. Two additional out-of>range references can be used 
to reset the BC or read out status. The specific memory-mapped commands 
to the BC are shown in Table 4-4. 

Table 4-4. Memory Happed BC Commands 
Command R/W - Address (AD0-A015) — Comments 

(1) Execute Bus Write to: 

Control Table 1111 0010 dddO ZZZZ 

A012 - Odd Parity over (AD13-AD15) 
(AD13-A015) Specifies which 
external bus to use fur 
transmission 

DB contains address of Bus 
Control Table 

(2) Read BC Read from: 

Internal Status 1111 0010 dddi 0001 

DB ■»- Status Register (value of 
internal flip-flops) 

(3) Reset BC Write to: 

1111 0010 dddi 0010 
DB ignored - BC is reset 

NOTES: AB0-AB3 ■ 0000 — Out of Range Addt'ess 

AB4-AB7 - 0010 — Identifies BC 
AB11-AB15 — Specifics BC cunniand 
d — don't care 


4-76 


4.3.2. 1 Th« Bm Control Tabl« (BCT) . But Control Tabloo aro thtM 
or four uorda long and havo ona of two fomata ^Ich ara dacodabla frcm 
ttia flrat word in the tabla, aa ahown in Tabla 4-5. 

Tabla 4-5. Bua Control Tabla Poraata 

Control lar/Tamlnal Trananiaalon 

4 ■ 0 Do this tabla and atop 
Word 1 I 

' • 1 Execute next table after conqpletlng thla table 
Word 2 Data Address for block in BC host SCCMs memory 
Word 3 A 1SS3A transmit or receive command 

Teralnal/Termlnal Transmission 

i " -32768 Do this table and stop (1000 ... 00) 

Word 1 j 

' • -1 Execute next table after this one (111 ... 11) 
Word 2 Data Address for block in BC host SCCMs memory 
(to "listen In") 

Word 3 A 1553A Receive Command 
Word 4 A 1S53A Transmit Command 


Word 1 of the control table specifies a Controller/Terminal 
or Terminal/Controller transmission if its most significant bit is zero, 
and a Terminal/Termlnal transmission if its MSB is one. For a sequence 
of short transmissions, it is useful to place their control tables in 
consecutive memory locations and direct the BC to execute them all auto- 
matically. This option is provided in the following fashion: If the 

least significant bit of word 1 equals one. the BC will automatically 
execute the next Bus Control Table after successfully executing the 
current one. 


4-77 


The first iford of e BCT le inspected to determine tdiich of 
the two formsts is employed* the remsining words ere interpreted in the 
following feshiottt 

(1) Trensmissiotts Between Controller end a Terminel 

The second word specifies the address within the 
controller's host memory where information is to be 
extracted or stored. If this address is positive 
(i.e. less chan 32768), it is created as an absolute 
address. If it is negative, it is complemented by 
Che BC and used as an indirect address, i.e. the 
specified location is used as a pointer to thd speci- 
fied data. The third word is a 1553A command to be 
issued to the participating terminal on the bus to 
which information is to be sent or received. 

(2) Transmission Between Terminals 

For a terminal/terminal transmission the second bus 
table word specifies an address to store data in memory 
of the BCs host SCCH. The word is interpreted as in 
(1) above. The BC "listens in" on Che transmission 
between terminals and stores the Information in its 
local memory where it may or may not be used by its 
host processor. The third and fourth words are the 
1553A receive and transmit commands necessary Co set- 
up Che specified communication. 

4. 3.2.2 Status on Completion or Termination . Upon completion or 
error termination of a communication, the BC writes a Completion Status 
Word (CSW) into a fixed location in memory and generates an interrupt. 

The CSW specifies one of five conditions: 

(t) Communication OK (COM OK) 

(2) Communication complete but terminals host SCCM has 
shutdown (MDOWN) 

(3) Requested bus not available (BNA) 


4-78 


(4) Error in tranmloaloa — > lapr^wr eodlag tfotoetod or 
■tatM Maaaga not retoraad (OOMEBR) 

(5) Inpropar activity on roqoaatad bua (BACT) 

In tha locatlona taoMdlataly following tha C8W, tlta BC 
atoras tha addraaa of tha Bua Cratrol Tabla idiich waa aaacutad* mid any 
(ona or two) 1SS3A atatur moaaagaa that wara racaivad. Thua up to four 
worda of atatua arai 

N CSW 

N4'1 Bua Control Tabla Addraaa 

N-fZ 1553A Statua Word* 

IH2 Second 15S3A Statue Word* (Tarminal/teroioal 

tranamiaaion) 

*Only atored if received properly 

4. 3. 2. 3 Redundant Bua tftiliaation . Tha BC can be connected to 
aeveral Interconaunication Bueaa. Its acceaa ia granted on the baaia of 
a priority aasignnent eatabliahed by "daiay chain” coimectlona for each 
bua. The bua acceaa control hardware is inplmitantad in the driver/ 
receiver logic external to the BC. The BC paasea-on the bus specifica- 
tion in the oeoory mapped conmand (AD12-A015) that caused its activation. 
The interface electronics either connects the BC with that bua or* if it 
is not available returns a busy indicator (BBUSY). The bus request is 
latched so that, if the bus is granted, the BC maintains control over 
the bua subsequent to the initial transmission. (Buses can be released 
by specifying a tranamiaaion over bus ”aero”, idtich ia iwm existent.) 

4.3.3 Bus Adaptor Functions 

The Bus Adaptor responds to 1553 transmit and receive com- 
siands directed toward its host module. It accepts or delivers the nu^er 
of worda specified in the Word Count field of the bua conmand. The func- 
tions performed internal to the bus adaptor are determined by the 5-blt 
sub-address/mode (S/N) field of the associated command. These functions 
fall into two categories: transfer functions and set-up functions. 


4-79 


4. 3. 3.1 Transfer Functions. Tsaatyei^t S/M values are interpreted 

" ~ ■■■ '■■' { '■* 

as Indirect Transfer instructions* and <me 8/M value is reserved for tiie 
Contiaie function* respectively* as described below: 

(1) Indirect Transfer - Ae S/H field i^ecifles one of 
28 pointers maintained at fixed locations within the 
host computer memory. Vfhen a transmit or receive 
cognnand is received* the bus adaptor accc^sses the 
appropriate pointer to determine the starting address 
for the Incoming or outgoing data. By modifying 
pointers the host computer programs can change the 
physical locations accessed through each pointer. 

Sequential data words are accessed for output to the 
bus, or input to the host memory frc»i the bus* using 
OMA-cycle-stealing techniques. 

Several bus adaptors may be moving data out of or into 
the host memory in a time-multiplexed fashion so long 
as none is forced to wait beyond 20 p seconds. (The 
maximum word rate of the 1553A bus.) 

(2) Continue - The continue function is specified by one 
value of the S/M field (00011) and is used for trans- 
mission of messages longer than 32 words as well as 
for direct addressing. The continue function in a bus 
command specifies that the specified transfer should 
continue from where the last transfer left off in the 
host computer mmnory. Thus a long message can be 
broken into a series of shorter transmissions from and 
into concatenated memory locations. 

Direct memory addressing is achieved by loading an 
internal address register in the bus adaptor with a 
special setup Instruction (described below) . A 
"Continue" transfer then moves data into or out of 
locations beginning at the specified physical address. 


4-80 


4. 3. 3. 2 Setup Instructions* Three special Setup instructiona are 

specified by individual S/N field values. They are (1) Direct Coianand. 
(2) Direct Address* and (3) Silent Acceptor. These are all **receive*' 
coomands with a one word data transmission (WC *1) ^ich contains the 
parameters of the specified function. 

(1) Direct Conunand (00000) - The data word sent to the 
adaptor (terminal) is decoded as a discrete comnand. 

If there is no error the least significant 8>bits of 
the received word is output (DC) and a stroke is gener- 
ated by the bus adaptor. Direct commands are used to 
generate interrupts* to effect power switching within 
the host module* and other direct control ■ functions as 
required* 

(2) Direct Address (00010) - The data word sent to the 
adaptor is loaded into an internal address register 
and is used as a physical address from which a subse- 
quent transfer can enter or extract data into the host 
memory. This setup Instruction is followed by a 
“Continue” transfer command to move data into or out 
of specified locations. 

(3) Silent Acceptor (00001) - The data word sent to the 
adaptor specifies a “soft” name. If a subsequent 
receive command is sent to a module with the same 
identification as the temporasy soft name* the adaptor 
“listens In” on the transmission and stores the trans- 
mitted data in its own SCCN*s memory. It* in effect* 
becomes a covert acceptor* and does not generate a 
status message. 

The silent acceptor mode is cancelled by any subsequent 
Direct Command* Direct Address* or Silent Acceptor 
command to the module. A silent acceptor module does 
not return status messages, since this is done by the 
module which is overtly addressed. 


4-81 


4.3.4 BIBB Implsmentation 

The BIBB consists of five subelsments t (1) the Mlll« 

(2) the External Bus Interface (BBI), (3) the Internal Bus Interface 
(I6I), (4) the Controller (G0NT)» and (5) the Fault Handler (FH)» as 
shown in Figure 4>30. 

The BIBB Is centered around the Mill , a small processor 
which Includes ROM, RAH, Internal registers, 2 Uid an ALU. Data words in 
transit between the external bus and the SCCM are buffered In the Hill. 
It Is also responsible for generating addresses for DMA, word counting, 
te control words, and other processing functions required of the 

Bx 

The EBI provides the interface between the Mill and the 
external bus. It accepts parallel command and data words from the mill 
and encodes them for seelal transmission over the bus. It also samples 
incoming manchester coded data words, performs seilal to parallel 
conversion, makes these words available to the Mill and signals the 
Controller of their arrival. 

The IBI provides a DMA interface through vrtiich Information 
can be transferred between the Mill and the SCCMs memory. It contains 
data and address registers for buffering incoming and outgoing data and 
DMA request and acknowledge control logic. The IBI also contains a 
command decoder, used to recognize and decode memory-mapped commands to 
the BIBB (from the host SCCM). 

The Controller generates control signals for the other 
subelements as a function of commands received from the external or 
internal (SCCM) bus and conditions sampled within the BIBB. It is 
ipicroprogramroed using both a ROM and a PLA. 

The various circuits within the BIBB use either error 
detecting codes or are duplicated and compared with self-checking 
checkers to provide fault detection. The Fault Handler combines these 
fault signals into a single morphic Internal Fault Indicator (IF). 

Upon detection of an internal fault, the FH terminates any ongoing 
transmission. 


A-«2 



Figure 4-30. Simplified BIBB Block Diagram 



In order to explain the workings of the BIBB we first 
examine the external and internal interface logic (BBI* and IBl). These 
circuits supply data and coauDands and largely define the onvironment of 
the Mill and the Controller. The latter two subelemmats are then 
explained as a fairly conventional processor. 

4.3.4. 1 The External Bus Interfac e. The external bus interface has 

two operating modes. In the input mofe it decodes words appearing on 
the 1553A external bus, and converts these incoming serial words to 
parallel NRZ form. It alerts the Controller when a 1SS3A Command Word 
or Data Word has arrived, and is available for transfer to the Mill over 
the BIBB internal bus (BIBIB). A one-word buffer (CDR) holds an incoming 
command or data word while the next word may be arriving over the bus. 
This allows a period of 20 ysec for a word to be moved to the Mill before 
it is ovennritten by a subsequent wird arriving over the external bus. 

A newly arrived word in the CDR may be output to the BIBIB in three ways. 
The sixteen bit word may be moved directly, or if the mrd is a command, 
the word count, or S/M fields can be right Justified and individually 
moved to the Mill. 

In the output mode, words are transferred from the Mill to 
the EBI. Each word is designated as a conmand or data. A command sync 
or data sync is appended and the irord is converted to serial biphase 
Manchester and output to the external bus. A one word buffer is pro- 
vided in the EMI so that a new output woru can be moved from the Mill 
to the EBI while the current vrord is being (serially) output. This 
allows up to 20 Msec to elapse between loading data words for output 
(before the message is interrupted for lack of data). The Controller 
is notified when the EBI is capable of accepting a new word, and output 
terminates when no words arrive from the Mill to continue the 
transmission. 

The External Bus Interface block diagram is shovm in 
Figure 4-31, and consists of a Manchester/NRZ Translator (KNT), and 
Buffer and Control Logic (BAC). The EBI is fully duplexed, i.e. lUcre 
are two complete EBI circuits (A, and B) whose outputs are compared 


4-84 



F .TSiS 







INTEtt4AL CONTHOL 
BUSd»») INPUTS 
CCI 


V—— II ■■II 

OONTHOl AND DATA INTtitFACX TO REST OF BIBB 

COOCO OONTBOl 
INPUTS MNt 


0 1000 NOP 

1 0001 DIWIIO 

3 0010 INWC 

3 1011 »ISM 

4 0100 OUTCMO 

5 1101 OUT DATA 


BSF-CDR 

B»» OMS) <-CDR (11-1S) 

B»» (10-15)- COR («>10) 

CDR-B»», OUTPUT COMMAND TO EXTERNAL* 
CDR - BIBS, OUTPUT DATA TO EXTEIMAL Biff • 


•AS SOON AS CURRENT CONTENTS OF XFR REG SENT OUT, DATA IN COR 
IS TRANSFERRED TO XFR AND OUTPUT W/PREaEOWG CMO SYNC OR DATA SYNC 

•OUTPUT MODE ESTABLI»«D BY RECEIPT OF OUT CMD OR OUT DATA IT IS CLEARED 
WtffN NO FURTHER WORDS SENT FOR OUTPUT 


R1 R3 MODE 


0 0 
0 1 
1 0 
1 0 


X NO RVPT 

INPUT DATA WORD REatVED 04 COR FROM EXTERNAL BUS 
INPUT COMMAND RECEIVED IN COR FROM EXTERNAL BUS 
OUTPUT COR IS CLEAR TO ACCEPT NEXT WORD FOR OUTPUT 


EBMir - INTEM4AL ERROR OETEOED IN EBM 

Figure 4-31. External Bus Manager Block Diagram 


4-85 




to detect internal faults. A transfer register (XFR) provides serial/ 
parallel conversion* and the Conmand Data Register (CDR) serves as a 
single word buffer through which incoming and outgoing words are passed. 

Both copies of the BBM receive data from the external bus 
(IUbDShI* INBUSLO) ( and the BIBB internal bus (BIBIB). However, only 
copy A outputs data over these buses. Copy B contains morphic compara- 
tors and compares the values being output with values it is generating 
to defect faults. 

The BIBIB is bi-directional (3-state) and consists of 18 
lines. Sixteen are for data (BIBIBO-BIBIB 15) and two represent parity, 
using the same code as is onployed in the SCCM internal bus. That is: 

BIBIB 16 » (BIBIB 0, 2, 4, ... U) 

BIBIB 17 « (BIBIB 1, 3. 5, 7, ... 15) 

Both the A and B copies of the EBI generate two control 
levels (R1 , R2) to notify the Controller of its state. Assignments of 
R1 , R2 are shown in Figure 4-31. In the input mode, they indicate that 
a command or data is available in the CDR register. In the output mode, 
they indicate that the CDR is free to accept new data. 

Coded control inputs (CCl) to the EBI are also shown in 
Figure 4-31 . In the input mode, they allow outputting the contents of 
the CDR (or only the S/M or WC) fields to the BIBIB. In the output mode 
they are used to load command or data words from the BIBIB into the CDR 
for subsequent transmission over the external bus. 

The output mode is established by executing an OUTCMD or 
OUTDATA command. The EBI ronains in the output mode until no new words 
are loaded into the CDR for output. It then returns to the input mode. 

The following paragraphs describe a preliminary logic design 

of the EBM. 


4-86 


4. 3. 4. 1.1 The Manchester/NRZ Translator (MNT) . The MNT synchronises 
with incoming bus data and delivers serial NRZ data. It also detects 
and signals data sync and command sync headers of the 1553A messages. 

The circuity shown in Figure 4-32 « has the following inputs and outputs t 

INPUTS: 8 mhz clock 

INBUSHl jdetect high and low 

INBUSLO (levels of the 15S3A bus 

RESET sets MNT to HALT State (So) 

OUTO (NOT) OUTPUT MODE 


OUTPUTS: DATA IN 

DATA m 
DATA CLOCK 
DATA SYNC 
COMMAND SYNC 
S5 


Serial bUs data and 1 mhz 
clock synchronized to 
bus data 

Data Sync being received 
Command Sync being received 
Not in State 4 


A transition and Zero Detector samples the external bus at 
an 8 mhz rate, if the bus has value zero during any two samples (i.e. 
for 125 Msec), it is assumed to be quiescent (BZRO) . If the bus changes 
value between any two samples a transition is signalled (XTN). These 
signals control a simple sequencer (shown in Figure 4-32), which runs 
at 8 mhz. 

The sequencer state is determined by ^ flip flops. The 
first three specify by one of six receive states (S^ - S^) which deter- 
mine the sequencer's view of what is occurring on the external bus as 
indicated below: 

S - HALT Bus is quiescent 

o 

Sj- WAIT SYNCYCl JFirst Microsecond of Sync Signal 

I (No Transitions Expected) 

S^- RESYNC i Second Microsecond of Sync Signal 

! (Transition Expected middle of 
'period) 


4-87 



Figure 4-32. Manchester to NRZ Translator 







Sj- WAIT SYNCYC3 
S^- RUN 

Sj- Error 

The other state flip flops form a time counter which is 
synchronized with the incoming data. The incoming data is being 
received at 1 mhz, while the time counter operates at 8 mhz - counting 
from t^- t^. When the sync signal arrives* the time counter, is set to 
t^* and It is periodically resyncronlzed during bus transitions so that 
t ot- define one bit time on the external bus. 

O 1 

A state diagram for the MNT is also shown In Figure 4-32. 

If transitions occur on the external bus at the proper times for a 
"correct” transmission the states (S -S.) reflect whether sync or data 

Oh 

is being received. If an improper transition occurs, or an expected 
transition fails to occur in the external bus the sequencer enters the 
error state (S^) and ceases operation until the current external bus 
transmission completes and the bus returns to zero. 

The Data Clock. Data Sync, and Command Sync are derived from 
the sequencer and are generated during the states for which they are 
valid, (as shown In the state equations in Figurip 4-32). A special flip 
flop is included to "remember" if a transition has occurred on the 
external bus during the cuvrent bus period (1 psecond) and is used to 
detect unexpected (error) transitions or lack of expected transitions 
on the external bus. XCt indicates that a transition occurred. XCO is 
its compliment. A special strobe pulse is generated to insure that the 
3:8 decoder is only enabled when its input signals are stable. Three 
conditions asynchronously reset the NNT. they are external RESET. BZRO 
(the bus returns to zero), and OUTM. When the BIBB is outputting to 
the external bus (OUTM). the MNT is disabled since it is only used for 
input. 


! Third Microsecond of Sync Signal 
(No Transitions Expsetsd) 

I Data Bit Being Rscaivsd 
(Transition Expected in middle of 
period) 

{Error Detected - Stop decoding until 
|bus becomes quiescent 


4-89 


4. 3. A. 1.2 The Buffer and Control Logic (BAG) . The BAC logic conelate 

of three parts: (1) BAC Control^ (2) BAC Data Paths« and (3) BAC FAult 

Detection Logic. The BAC Is unusual in that it uses the SCQI clock (A2) 
in the output node (OUTM), and it uses an external-bus derived data 
clock for Internal control in the input node (INN). 

BAC Control 

Figure 4-33 ahowe the BAC Control logic. These circuits 
decode incoioing conmands (CCI) to the BBI» control the EBI input or 
output node (INN. OUTM)» and provide a counter synchronised to inconlng 
or outgoing serial data (M1-M20). The following paragraphs describe 
various conponent parts of this logic. 


SIMPLEX SYNCMONIZER 



4-m 






(a) Conunand Pacodar - Dacodas CCl comands or datacta 
inproparly codad connanda* 

(b) Stata CouCrol - Racaipt of an (WTDATA or (KJTCMD 
conmand astablishaa tha output noda ((HUM) and cauaaa 
tha BAC to uae tha SCQI clock ^2. Tha OUTM ooda la 
Carolnatad when tha BBM raady to output tha naxt word 
and no naw words have bean sent for output* i.e«* 
OUTCMD or OUTDATA has not baan racaivad. The three 
pairs of flip flops provide a means of recording tha 
next OUTCMD or OUTDATA command (fp2) while the current 
such command is >>eing executed. If no new commands 
have baan recorded by fp2 when it is time to send out 
a new word (Ml)* fp3 is reset and OUTM is terminated. 

(c) M Counter » The M Counter is synchronized with 
incoming or outgoing data words. In the input mode 
(INN) it is started at Ml* when the first data bit 
arrives from the external bus* and reaches count M17 
when the final parity bit arrives on the incoming 
word. During INM, the M counter is reset to Mt by an 
incoming Cononand Sync, Data Sync or no activity on the 
external bus. It is advanced by the incoming Data 
Clock which generates seventeen pulses as the data and 
parity bits arrive. An 18th pulse is generated (M18) 
to allow follow-up logic functions such as transfer- 
ring the newly arrived word from the XPK register to 
the CDR register and alerting the BIBB Controller. 

In the output mode (OUTM) . the first three counts 
(M1-N3) designate the time when a data sync or command 
sync is outp'.t to the external bus. M4-M20 corresn»*nd 
to transmission of data bits and parity of the out- 
going word. The H counter is reset by the initial 
OUTCMD or OUTDATA command which initiates the output 
mode (NEWOUT • OUTM) . (The M counter is a 20-count 
counter. ) 


4-92 


(4) CoBtrolUr Altt Logic - This logic gCMrcCM tiM 

■igiicU R1(R2 Mhicb alert the BIBB Coatcellac to ^ 
arrival of aa ieput data oovd or tha aaod for aa 
additional oatpnt data word* fhia logic ia otraigjbt- 
forward with the oxeaption of the Biaplax fynduronlaor. 

During the input Boda» an avallabla data word ia 
oignallod by M18 and correct parity on the arrived 
word (nm • PTY • M18). Bowovor* the A & B copioa of 
the BBX nay be out of atop by 125 nsoc ainca they each 
use their own bua^derivad clock CL. The Siaplex 
Synchroniser waita until both copies agree that the 
word has arrived, and then synchronises the gemtration 
of the RItKl signals with the SCQf clock d2 (which is 
the clock used by the BIBB Controller). 

BAC Data Paths 

Figure 4-34 shows the BAC Date Paths . The Transfer Register 
(XFR) provides serial-parallel conversion for incoming and outgoing data. 
A serial parity checker is used to check incoming external bus words, 
and to encode outgoing words. The Consuind-Data Register (CDR) serves as 
a one-word buffer between the BIBB internal bus and the XFR register. 
During the Input node , each incoming data word ia automatically trans- 
ferred to the CDR register imediately after it is assembled in XFR 
(at N18). and the BIBB Controller is alerted (^2-11'. or R2*R1). The 
controller has approximately 19 useconds to remove the word in CDR 
before the next word arrives. The output driver logic allows contents 
of the CDR or selected fields to be output to the BIBIB. 

Vlhen the output mode is established (by OUTCHD or 0’;:UrrA) 
a command or data word is moved from the BIBIB to the XFR registei. 
Subsequent OUTCHD or OUTDATA coossands move data from the BIBIB to the 
CDR register. As each word is shipped out of the XFK (at M20) a new 
word is taken from the CDR register. Transmission stops when no new 
word ia available. (It is important that the first output word not • 
disturb the CDR. At one point in the 1553A transmission sequeiu'o, a 
sta*'us word is output before n 1553A command In the CDR Is fully 
processed.) 


4-9J 


HOO 

OU1M, 

CNAME 


SVNCFO 

Ml. M2,«» OUM 

TOEXTHNAL % 1 

MW 

1 


MANCMfenCR 

ENCOMI 

1 



... 

MUK 

h 

— 1 




CPU) 

C3S0 


• CMPS 
-—OATS 


DATA IN 


n»MMNT< 


nM*smmo. 

Xf*(17-«ltS) 


xsa 




X3S0 ■ ^Mit fflffPSL 




COMHAANO^ 
(COlO 16 2 


PAftALLELt^ 


mi 


a-L-/ 


'ITC—— 1 
-^=0-^1 ;iU 

■■ « « , * — III,., „ I ^ g Q sgtiAL 

I fifinY 


0«CKBt 


DATA 

REG 


xso. « (iNM* 9CYC + ouiM* ouicvq *a 

XPi » INM • (OlJKaMD ♦ OUTOAIA) OUIM* «A20 • a 
X3SO » WNk* MI6 

CPU) » OUIM* (OUlCMO * OUfOATA) ^ )C^ 

C3SO - N4WltO ♦ INWC> INSM 
INWORD « OUlCMO '*• OUTOATA 


3-STATE 
INPUT OMVBtS 


INWORO < 




PEER 




3-STATE 

OUTPUT DRIVERS 
AND 

FIELD SELECT 


JNWRO--BIBiB^CDR(0- IS) 

C - - BIMB (tt - IS) ^ COR (ft - IS) 
INS M - • BIBIB (II - IS) CDR (6 - 16) 


PARITY 

CHECK/ 

GENERATE 


Jt 


.-IL-l, 

•_ »MPA«C*__; 




F 



*IN COPY m OF ESI ONLY. 
COMPARE nna WITH 
OUTPUT OF COPY (B) 


HBIB 

(BIBB INTSINAL BUS) 


NOTES: INM — INPUT MODE.- OUTM— OUTPUT MODE (INM'OUTM) 

INCYC • Ml • MI7 (COUNTS INCOMB4G DATA BITS AFTER SYNC) 

OUTCYC • M« • M20 (COUNTS OUTGOING DATA AFTBt SYNQ 

OUTCMD, OUTOATA ~ CO COMMANDS TO EBI TO TAKE WORD 

FROM BIBIB AND OUTPUT IT OVER EXTERNAL BUS 

INWRD, INWC, INSM — CCI COMMANDS TO OUTPUT WHOLE WORD OR 
FIELDS TO BIBIB 

CL • aoCK WHICH OOUNT»SBIIAl DATA BITS • IN OUTPUT MODE IT IS THE 
INTERNAL aOCKBE. FOR INPUT MODE IT IS A BUS-DERIVED DATA aOCK 


Figure 4-34. Extcrn.'il Bus Interface, BAC - Data Paths 


4-94 








The Parity (aieck/Generate Circuit checks that laooming 
words fr<HB the BIBIB are properly coded and aaeodes outsoing words to 
the BIBIB. 

The Manchester Encoder is a conbinational circuit which 
generates a two*^re output to the 1553^ bus driver* with the following 
interpretation: 


OUTPUT ENABLE 
0 
1 
1 


HILO 

d 

1 

0 


EXTERNAL BUS 
0 
+1 
-1 


It generates a command sync (CMOS) or data sync (DAIS) 
durirg and then Manchester-encodes the data bits Which arrive 

during 


BAC Fault Detection Logic 

BAC Fault Detection Logic is shown in Figure 4-35. Each 
copy of the BAC compares its outputs with the other copy and* after 
careful strobing to assure that the signals are stable, sets a latch FI 
(A*B) if they disagree. Similar latches record parity errors detected 
on the BIBIB (F2) and Improperly coded commands (F3). In each copy of 
the BAC* a master fault indicator (EBMIF) is generated and sent to the 
BIBB Fault Handler. 

Four of the fault indications (FI A* F2A, F3A, EBMIF(B)) can 
be sampled for diagnostic purposes by (DUMPSTATX. This function is 
activated by a Read Internal Status Command from the SCCM to the BIBB. 


4. 3. 4. 2 The Internal Bus Interface (IBl) . The IBI provides the 

mechanism by which the BIBB can perform Direct Memory Access into the 
memory of its host SCCM. Being connected to the SCCM*s internal buses, 
the IBl is a convenient place to place the decoding circuitry for 
memory-mapped conanands to the BIBB. 

As shown in Figure 4-36, the IBl contains three 18-bit 
registers to support DMA: an address register (ADROUT), and two data 

registers for Incoming and outgoing words (DRIN, DOUT) . When the BIBB 
sends data to the SCCM memory, it transfers an address via the BIBIB to 


4-95 



>iN00fVA« 


<»nr» 


L 


Momic 



INGOPVt 



caitt(A) 




Kisa- 


FIA F2A F3A EBMIF<B) 

I I i i- 


UTCNES 


T 


FW(1?- 


T 

CBMIF(A) EBMlF(a) 


FIA - OISAOBEEMENT 

F2A - PAIIITY EimOfi ON BI&IB 

F3A - FABITY ERROR ON CCI 

DUMFERR « OUTPUT FAULT 

UrCHES TO BIBIB 

RESFl - RESET FAULT LATCHES 


Figure 4-35. External Bus Interface, BAC - Fault Detection Logic 




A,»com$ 













and it ttmatvn a data ««rd to Ths BITO than aotivatifi 

elie iMt CootroUar «ith a DHMHIttTB caomaad. CmUoU^ 

control of tha SC^it Intanial Biis» and parfoma tha apaelfiad D»mory Wita* 

To road tha SCOf's an addraoa ia tranafarrad fntt tha 

BIBIB to ADBOOT and*a DHAIW coanaad ia iaauad td tl^ piA Cont(#Uajr. 

The ccmtrollar raturna a ha^ aiBnal idiila tha DMA tTiniaactioo ia , 

progreaa. ' __ . 

The three addrasa and data reglatera are Ind^mdently caai*f 
trolled by the BIBB Controller and the DMA Controller* A four-bit 
Transfer Code (TC) ia sent front the BIBB oicroprograniTOd Controller and 
decoded to control transfer of data into hnd out of the registers from' 
the BIBIB, aa shown in Table 4>6. 

An independent set of controls (D3S0, DIN, A3S0) are gen- 
erated by the DMA Controller to gate data words onto or off of the SCCM 
local bus. Fault detection in the ADROUT, ORIN. end DOUT registers is 
It^plemented using the error detection code (with two parity bits) tdilch 
is common to both the SCCM internal bus and the BIBB internal bus. In 
order to detect the failure node of a disabled load signal, the regis- 
ters can be periodically reset to zero (which is uncoded) by the BIBB < 
microprogram (CLEAR). 

The Direct Conmand Register Is also included in the IBI. 

One form of bus transfer (DC) causes eight bits from the BIBIB to be 
loaded into the DC-Reg. Another comnand gates out this byte along with 
a strobe level. 


Table 4-6. IBI Transfer Commands 


Code 

Source 

Destination 

0001 

DRIN 

BIBIB 

0010 

DR IN 

ADROUT and BIBIB 

1011 

BIRIB 

ADROUT 

0100 

BIBIB 

DOUT 

1101 

BIBIB (8-15) 

DC REf. 

1110 

STROBE - 



4-98 





Two duplicated comnumd decodsre are sn^loyed to detect the 
three memory-ouipped conmands to the BXBB. Blther decoder can iaaue a 
RESET or Read Internal Status (DUM^R) .cooniand. Each Execute Bus 
Table Command is sent to one of two duplicat«4 control sequeacer 
circuits. If they disagree a massive disruption of control will occur 
and be detected in the Controller. The Bus Assigimrant Latch stores the 
nuDd>er of the external bus being requested for a transmisslmi. It is 
parity checked and a fault latch is set ^cn the parity signal is stable 
(BSBLF). Figure 4>37 shows the DMA Control Logic. Its input ccmimand 
codes (WAC) are listed in Table 4-7. 

The DMA Controller is an asynchronous circuit. Upon 
receiving a (ID1AC) command, the corresponding flip flop (READ, WRITE, 
HOLD) is set. The SCCM internal bus is requested (R), and upon receiv- 
ing an acknowledgement (AK), the following occurs: 

(a) For a READ command 

(1) The address is gated out (A3S0); NWRITE is 
raised, and a memory start (MSTART) is issued. 

(2) Upon receipt of a completion signal from memory 
(COMPL) , data is gated into the DRIN register 
(DIN) and the READ flip flop is reset. 

(b) For a WRITE command 

(1) The Address is gated out (A3S0), DOUT is gated 
to the Data Bus (D3S0), WRITE is raised, and a 
MSTART is Issued. 


Table 4-7. DMA Command Codes (DMAC) 


DMAC (0 

- 2) 

COMMAND 

1 0 

0 

NO DMA — Drop DMAHOLD 

0 0 

1 

DMA READ — DRIN M(ADROUT) 

0 1 

0 

DMA WRITE — M(ADROUT) DOUT 

1 1 

1 

HOLD — Hold SCCM Internal Bus 





I 

1 

{ 




I 


I 

4 

i 

1 


i 


4-99 













(2) Upon cooplotion (GOKn<) tlm HRIH flip flop In 
rnsot. 

(c) Tho Hold otnto only roqunsts (R) and holds thn 8CQt 

internal bus* Since it tekea at leaat 3 clock perioda 
to gain bua acceaa» HCMJ) can be iaaued early to 
overlap setting up of ADROUT, DOUT, and the gaining of 
bus acceaa. 

The check circuit containa two flip flopa which are set by 
READ and WRITE conunanda. They are reaet only if they *'aee” that the 
DMA cycle actually occurred (i.e. » the appropriate command level (RO* 
WT), a bua acknowledge (Ak) end e coaqpletion elgnal COHPL). Two 
conditions result in the fault indication DMA ER: 

(1) The check circuit **aaw" a DMA command but none was 
performed. (The check flip flopa do not get reaet, 
resulting In the Z * StlSY fault condition.) 

(2) A DMA was perforated but the check circuits did not 
receive a coaaaand (MST • COMPL • Z) . 

Figure 4-38 shows the fault-handling circuitry for the IBl. 
There are four error checks. Boch control inputs, (TC) and (DMAC) are 
parity encoded, and they are checked with morphlc parity checkers %dtlch 
generate morphlc signals PERR and DMACMP. These signals are synchronous 
with the BIBB internal clock and can be combined and sent to the Fault 
Handler. The other two fault signals (BSKLF) and DMAFK are not synchro- 
nous with the BIBB check and are latched locally witliln the IBI. 

To generate a single "morphlc" IBI fault indicator (IBIF), 
we reduce the two incoming morphlc fault signals (PFKR, DMACMP) to a 
single morphlc pair and then logical - or the other two simplex fault 
signals to both lines of the morphlc pair. This results In forcing the 
morphlc pair (IBIF) to the error state (1,1), if one of the simplex 
fault signals Is activated. 


4-101 



Handling Circuits 






Fault conditions arc latched and can be read out with a 
read internal status conmand. The (OUNPBRR) signal, generated by that 
conmand, causes the fault latches to be output and transferred to the 
SCCN data bus (see Figure 4-36). 

4. 3.4. 3 Tht Mill. The Hill provides limited proceasing capability 

in the BIBB and is shown in Figure 4-39. Its two main cooponents are a 
memory and ALU. The Hill memory contains 48 eighteen-bit words of RAH, 
and 16 eighteen-bit words of ROH. The parity encoding used to protect 
the BIBIB (i.e., 2 odd parity bits over even and odd bit positions) is 
also used to provide detection of Hill memory faults. A Hill memory 
word can be output to the BIBIB from an address specified by either 
(1) the BIBB microprogram, or (2) a local memory addrcLS register 
(LHAOR). 

Also included in the Hill are a pair of sixteen-bit 
A registers and ALUs. These circuits are ^luplicated and compared with 
a morph ic comparator (MCALU) to implement fault detection. Words on 
the BIBIB can be stored in the A register and are also sent to the 
port of the ALU. ALU outputs can be loaded back Into memory or into 
the LMADR register. Control codes and condition codes are shown in 
Figure 4-39. It is possible to read modify and write a single Mill 
memory word In a single clock cycle (e.g. , increment a location). 

Four fault checks are provided which are all morphic and 
synchronous with the BIBB clock ^2^* 

The address sent to Hill memory, and the BIBIB are checked 
for the (2 parity bit) internal bus code (HPCI, HPC2). The morphic 
comparison of ALU outputs produces the morphic disagreement indicator 
(MCALU). The control codes and memory address from the microprogram 
(MILC) are encoded with a single parity bit. A morpliic parity check is 
performed producing (MILCK). 


4-103 

















TIm f«ilt indlMtors *rm conbiiMd into a olngla oorpliic 
fault dataction pair (MXLLCIOt) which is amt to tht BIBB Fault Oandlar. 
Figura 4-39<h) ohowa that tha individual norphtc fault indicatora are 
dacodad, latchadt and nada availabla for mad out with a road intamal 
atatua coaaand to tha BIBB aa pravloualy doacribad. 

4. 3*4. 4 Tha Controller . Tha Controllar con'*iata of a Control 
Saquencar (C8) and a Control ROM (CBOM) which contalna the nicroprograna 
for tha BIBB. Figura 4-40 ahown tha Control Saquancar* and CROH. Tha 
C8 aaiBplaa varioua conditiona froa tha othar logic circuita within tha 
BIBB. On tha baaia of thaaa conditiona it outputa a aaquanca of 
addraaaaa to tha control ROM. Tha CROM* i>* turn* mpa thaaa addreaaaa 
into tha control aignala nacaaaary to operate the BIBB. 

Inputs to the C8 ara liatad in Table 4-8 along with their 
aaaociated control infomation: 

Table 4-8. Control 8aquencar Inputa 
Input Aaaociated Control Information 

BIBIB - 1S53A comnanda - Terminal I/O. S/M, and 

word count fields are available to CS 
along with T/R (transmit receive bit) 

BBUSY - From external logic - indicates that 

requested bus is not available 

BZRO - From EBI - indicates that external 1553A 

bus is idle 

R1,R2, - From EBI - indicates incoming commands or 

data have arrived or a new word can be 
accepted for output (see Figure 4-31) 

OUTMODE - From EBI - indicates EBI is in the output 

mode and is sending data over an external 
1S33A bus 

COND - Conditions from ALU - indicate that current 

arithmetic result is PLUS, MINUS, or ZERO 


4-106 


BUSY AND EX* 



Figure 4-40. The Controller - CR(»i and CS 







EX 


Execute Bus Teble Gommand received from 
the IBI 

DUB BUSY 

- 

From IBI • BMA in progress 

BXBM 

«m 

Prom input pins - indicates **hard** name 


Ttie Ci0H g«neratff8 a set of control levels (STATE, CSEL, 

CNCC, T0CN01, T0CM02) which are used in the CS as will be described 
below. Most of the other CBOM outputs are the signals (previously 
described) which control the HILL (HILC) , IBI (TC, DMAC) , and EBI (CCI) . 
Three additional signals are generated which require explanation. One 
(RDPT) is a programmed interrupt to the SCQi. It is pseudomorphlc in 
that its complement is directly generated as shown. All CROM outputs 
are encoded in the error detecting code shown in Figure 4-40, are pro- 
tected with two parity bits (P1,P2), and are checked with a morphic 
parity checker. One odd control is included (DISAMILLE) - disable Mill 
Fault Indicator. The Hill fault indicator (MILLCHK) is only valid when 
there is properly coded data on BIBIB, tdiich is most of the time. For a 
few >!:icroinstruction8, BIBIB is not coded, and the programmer commands 
the Fault Handler to ignore MILLCHK during these instructions. 

4. 3. 4. 4.1 The Control Sequencer (CS) . The CS is built around a PLA and 
a Microprogram Location Counter (MLC) as shown in Figure 4-41. The MLC 
generates a sequence of addresses to the CROM. It is reset to zero and 
counts in the following fashion: 

(1) If the PLA outputs a non-zero number, which Is not 

g 

2 -1 (all ones), that number will be loaded into the 
MLC as a branch address (executed at the next clock 
period) . 

(2) If the PLA outputs zeros, the MLC will continue to 
the next sequential count (address). 

g 

(3) If the PLA outputs (2 -1) all ones, the current value 
of the MLC will be reloaded - holding it at its current 
value. 


4-108 





Figure 4-41. The Control Sequencer 













PLA operation la controlled by tha 6>blt STATE input from 
the microprogram. Each defined STATE input value (with the exception of 
atate aero» 80 « OOOOOO) actlvatee a net ojf AND temm in the PLA uliich 
determine varioua branch addreaaea aa a function of the PLA ii^ce. For 
State SO, no and-teTma are decoded, so the microprogram proceeds 
sequmtielly. 

An exan^le of the branching technique, taken from the BUS 
Adaptor Microprogram is shown below in Table 4-9: 


Table 4-9. A Control Sequencing Example 


CROM 

Location 

State 

PLA and Terms 

Control Outputs 

1 

SI 

R1*M1ME -> HOIJ) 

R1*MIHE>T 26 
R1 •MINE*T* (OP « OOOOd) -»• 18 
R1*M1NE*T*(0P » 00010) -*> 18 
R1«MINE*T‘(0P » 00011) -► 7 

BIBIB ^ COmiAND 


R1*MINE and all other OP-codes fall through as 
sequential code 


When Che microprogram gets to location one, we wish to do a 
five-way branch on the basis of a 1553A command received in the BIBB. 

We display the command on the BIBIB, which includes a T bit, and a 5-bit 
S/H field which is interpreted as a command OP-code. These six bits are 
sent directly to the PLA, along with a condition signal R1«HINE which 
indicates that a command has been received which was addressed to this 
BIBB. The state S*1 activates the five and-terms shown above. 

If no command arrives (i.e., R1»MINE), the PLA outputs all 
ones and generates a one- Inst. 'uct ion wait loop. When the coimnand 
arrives (R1*MINE), the PLA generates a transfer address corresponding to 
the command being decoded. 

Inputs to the PLA are listed below: 

(1) T0C1 - A time out counter to verify expeditious 
completion of DMA in the SCCM 


4-110 


(2) MINE - When a eoas^tnA la placed oa tlie MINB 

indicates that its module ID tMtchm the hard name or 
soft name of the BIBB* 

(3) T BIBIB (5) is the Transmit /Receiver bit of a 1553A 
command displayed on BIBIB 

(4) S/M Field - BIBIB (6-10) the S/H field of a 1353A 
command displayed on BIBIB* It is also designated OP 
(for op-code to microprograms) 

(5) STATE - from CRQH* Activates group of PLA-AND terms 
to define branch(s) associated with a given micro- 
program location* 

(6) CC - Condition signal - selected as one out of sixteen 
by Condition Select (CSEL) control from microprogram. 
Multiplexed condition signals are shown in Figure 4-41* 
Only one can be used at a time for a given branching 
instruction. 

Other circuitry in the CS is explained below: 

(1) Status Register (SR) - contains the 155^ A status word 
to be output during transmissions. SR (0-4) contains 
the external bus name determined from the external 
pins. SR(5) » 1, indicates an internal fault has shut 
down the host SCCM, and is generated from the Output 
Disable levels from the Core-BB*^ One CS outputs data, 
and the other outputs the parity bits for fault 
detection. 

(2) ID Compare - The terminal ID field of an incoming 
command (displayed in BIBIB) is compared with the hard 
name and soft name of the BIBB. If the hard name 
matches in a transmit command, or if either hard or 
soft names match in a receive command, the level MINE 
is raised. A soft name register can be loaded or 
cleared under microprogrammer control, from BIBIB 
(11-15). The terminal ID of zero (00000) is reserved 
for broadcast commands since all BIBBs with their soft 


4-111 


nane vagMtar ant«0 will cec<^iM It, A latch is 
provided to **raiieoter" that a aoft iiaiiiB match occarred 
until tha and of a trananiaaloa (SQPt) . It can be 
reaet uad^r program control. 

X 

(3) A loop counter la provided tdiich can be loaded from 
BIBIB, and decremented imder microprogram control. 

Its underflow is signalled to the condition multiplexor 

LZBO. 

(4) T0C1 - This time out counter cotmts eight pulses, and 
its overflow is an input to the PLA. It can be reset 
under program control. 

(5) T0C2 - This timerout-counter is used to detect when an 
expected incoming or outgoing word did not arrive. It 
is reset by R1,R2 or both under microprogram control. 
T0CN01 and T0CN02 inhibit resetting of the counter by 
R1 and R2 respectively. The counter counts 26 clock 
times, which is longer than the time for a single t;i>rd 
transmission. Thus if the expected words arrive, it 
will not overflow because it will be reset by the next 
arrival of a command or data sync (R1,R2) at time 20. 

If the expected coiranand or data word does not arrive, 
the counter overflows, and delivers the signal T0C2 to 
the condition multiplexor. 

(6) F1,F2 - These flip flops can be set, reset and tested 
under microprogram control. 

4. 3. 4. 5 The Fault Handler . The Fault Handler (FH) is shown in 
Figure 4-42. It is responsible for collecting fault signals from the 
BIBB and, if a fault occurs, signalling an internal fault IF. As shown 
in the figure, the morphic fault indicators are combined to a single 
morphlc pair which is decoded by duplex exclusive nor circuits and 
combined with the EBl fault indicators (EBMIF) to set a pair of dupli- 
cated fault latches (f^.f^). These latches generate the IF, IF signals. 
These fault signals are fed to a two pairs of clocked flip flops. 


4-112 


CONTHB) 



Fault Handler 















The first pair (ty f^) provide the option to stop the clock to the BIB’i 
before a reset occurs. This is useful in breadboarding for fault isols- 
tion. nie second pair guarantee a full cycle reset pulse to return the 
BIBB to an initial state. 

The reset command from the SCCH generates RES (A) and UCS(B) 
from duplex command decoders in the IBl. These simulate an internal 
fault which results in a reset. 

Additional circuits are provided to stop, start, and single 
step the clock to simplify breadboarding. 

A. 3.5 BIBB Microprograms 

The following are preliminary register-transfer descriptions 
of microprograms which cause the BIBB to perform as a Bus Adaptor or Bus 
Controller. The mnemonics refer to signals and registers previously 
described in this text. The notation M(XXX) refers to a Mill memory 
register containing the variable or constant named XXX. These variables 
are listed below: 


(a) 


Bus Adaptor 


BASEADR 


PTR 


UC 


Address in SCCM memory where the mapping 
table resides which maps command pointers 
(SM) to data addresses 

A pointer used to read out or store data 
words in sequential locations in the SCCMS 
memory 

Word count, counts words transmitted and 
is taken from 1S53A command field 


COMND - Memory location used to store incoming 
command 


BUSERADD - Address in SCOH where BA can score error 
message 


ERRMESS - Error message word from the BA 


4-114 


(b) 


§Utt CgBtKliitt 


BCTAOS * Address of Bus Control Tsbls In SCOl 

PNT - Pointer 'used to access BCT words 

BCT1, BCT2, BCT3 



- 

First three words of BCT 

PTR 

- 

Pointer to data words in SCCM memory 

WC 

- 

Word count 

STAT1 

- 

Status word returned in a controller- 
terminal 1SS3A transmission 

STAT2 

- 

Second status word returned in a 
15S3A terminal-terminal transmission 

STATLOC 

- 

Location where Controller Status Word is 
stored in SCCM memory 

STATLOC+1 , 

STATLOC+2, STATLOC+3 
Sequential locations from STATLOC 

MDOVIN, 

C(»IOK, COMERR, BNA, BACT 

- CSW status words stored in STATLOC which 
indicate the results of the transmission 


The Bus Adaptor and Bus Controller Microprograms are shown 
in Tables 4-10 and 4-11. 


4-llS 


Tabic 4-10. Bus Adaptor Nlcroprograffl 


CRON LOCATION RTATl PLA-AND TBRM8 CONTROL OUTPUTS 


CMD 


0 


t 


SO 


A^(BASEADR) 


C Load A Rag. with SCCM addraas of pointer table 
C Then wait for incoming command 


! r 1 •Hi^-HIOLD 

R1 •MINB*r«TRMIT 

R1 'MINE*T* (OP-OOOOd)>SP 

R1 'MINE-T- (OP-OOOIO)-^SP 

R1 •MINE'T- (OP-0001 1 )•«> 
WAITNEXT 


BIB1B«C0R, CKSOFT 
C Transmit Command 
t Special Command 
C Special Command 
C Continue Command 


C Branch on incoming T/R and S/M bits to processing 
C routine. 


RID 2 SO DMA HOLD, M(PTR)«^DR(SM)-i-A 

3 SO DMA READ. ADROUT^M(PTR) 

4 SO START TOCl, M(WC)*^CDR(WC) 


( RID - Read Indirect Command — Move WC to RAM in MILL 
C and start DMA cycle to get data address specified by 
C S/M. 


5 


S2 


DMA BUSY^LD 

TOC 1 • DMABUSV-^TIMEOUT 


6 SO M(PTR)«^DRIN, NO DMA 

C We now have absolute SCCM address for incoming data 
C Now we wait for the data, or a XMIT command if this 
C is the first command of a terminal to terminal 
C transmission. 


/ T0C2*r 1*R2^H0LD BlBlBoCDR, 

WAITNEXT 7 S3 ^ R1 »MAYBE C maybe T/T Transmission 

( TOC2*!n*n»TlMEOUT 


4-116 




T«bl« 4 -IO 4 Bus Adaptor Microprogron (Gontiauatlon 1) 


CROH LOCATION STATE PU-AND TERMS OONTBOL OUTPUTS 


7 t If command go to Maybe » If data Interrupt continue. 


8 

84 

•HDATAItH-l 

A first data word in CDR 


1 

R2*fOC2-H!OLD 

( wait for next data word 

DATA IN 9 

S5 1 


in 


1 

TOC2*R2-»fIMBOUT 

T0CNO1 

10 

S6 

PLUS + S0PT-^*+2 

M(WC)<^M(WC)-1 


t If 

soft name or not end of message skip status output. 

11 

SO 

— 

BIBIB^SR, OUTCMD 




1 DMA WRITE ADROUT«M(PTR) 

12 

SO 


j M(PTR)*«4!(PTR)+1 

13 

SO 

— 

DOUT*CDR 


C Write Received Word into 

SCCM's memory. 

14 

57 

ZERO-»CMD-l 

M(Wt:)-K) 


C If 

word is not zero, end 

message, wait for next command 

15 

S8 

►DATA IN 

C else wait for next data 

MAYBE 16 

S9 

MINE ►ERROR 

UllUKoCDK 


SP 


C It is a terminal /terminal transmission it' not mine. 


R1 *TOC2*HOLD 


T0CN02 


17 SIO < R1-»DATA IN 

TOC2 *‘rT*>T1 MEOUT 


C When status arrives, 
C then wait for first 
C data word. 


18 SO 


M(COMND)'C!)R. ( l.KAK SN 


19 S5 


I R2*TOC2^«)LD 
I TOC2*R2^TIMEOUT 


TOCNO 1 

C I f n«* d.it .1. I im.’oiit . 


C Walt above for data to arrive (R2) 


4-117 


Tabi* 4-10. Sna Adaptor Nicropn^ram (Continuatloa 2) 


CIK»i LOCATim STATE PLA-AND TBEtB OOHinUH. OUTPUTS 


DC 


SETSN 


XMIT 


CETFW 


20 

21 

22 

23 


24 

25 
27 

26 

29 

30 

31 

32 

33 


SO 

aiMwa 

r Malt for tine Status Out 

SO 

— 

B1BI1K8R. OUTCHD 


1 OP-(00000)-*DC 


S11 

( OP- (00001 )-^ETSN 


S12 

•*CMD-1 

M(PTR)«CDR 


C This Is the dirsct address connaiid (00010) which loads 
C a value into the pointer register. Coramand is com- 
C pleted. return to sero. 

S12 -K31D-r1 DC«<n}R, STROBE 

C Output Dir- :t Conmand completed, return to zero. 

C do not collect $200. 

SI 2 -»CND-1 BIBID^CDR. LOADSN 

C Set soft name completed 


SO 


DMA HOLD. M(PTR)-k:OR(SM)-»-A 

< Establish ptr to address of 

data. 

SO 

- — 

ADROUT-M(PTR), DMA READ 

so 

— 

M(WC)-K:DR(WC). START TOCI 

j 

[ DMA BUSY-^HOLD 

C Wait for IWA to complete 

S2 

1 

1 TOC !• DMA BUSY ►TIMEOUT 


so 


M(PTR)-*DR1N. ADROUT-URIN 

C Now 

we have the address of 

the data, next get the data. 



1 DMA READ, M(PTR)- M(PTR-H ) 

so 

( DMA BUSY ►HOLD 

f 

1 START TOCl 

S2 

1 

1 

1 TOC !• DMA BUSY*TiMEOUT 



4-118 


Tabic 4-iO. But Adaptor Microprograai (OootiBiiatlon 3) 


CROH LOCATION STATE PU-AMD TBBN8 CONTROL O U T PU T S 


LOOP 


34 SO CDRHAIR, OUTDATA, HO UHA 

$ Pirat data tford aant out. 

36 87 ZBM>*CMD-1 MOiC)«i4(UC)-1, START TOC 1 

C Exit If only ono word alao go into output loop. 

1 Il<focl-HI0LD 

. 

R1*TOC2-»TIMBOUT 

38 SO 

I ONA BUSY-HULO 

TOCI*imA BUSY-^TIMEOUT 

40 SO CDR>*-DRIN. OUTOATA 

C Send next word for cranatni salon. 

4t S7 ZERO>CMD-l M(WC)<Hrt(.7C)*1 

C If word count ■ 0. transniaslon la conrI«L« 

42 SI4 -«^IjOOP 

a 

C Else wait to deliver next word. 


$ Wait for BBI ready for 
t next word. 

I DMA READ, AOROUT*M(PTR) 
M(PTR)^M(PTR)-H 


TIMEOUT 

43 

SO 



ERROR 

44 

SO 



45 

SO 



4b 

SO 



47 

SIS 

DMA BUSY-*H0L0 


48 

SI3 

-CMD-1 


NOP 

AOROU1*^N (BUSER ADD) 
00UT<N (ERRMESS) 
DMAVRITE 


C Optional - Write error flag In SCCM memory upon 
C detecting a bus error. 


4-119 



Table 4-fO. Btt« A4aptor IUcco|Mr«cc«M <CoatimMtlim 4} 


CBQN LOCATim 

8E4TB PU-4» TOM 

oonmt OUTPUTS 

TmZT 

49 

S16 OP»(00011)-»N>2 

BXBZB«€IHl 


50 

817 

BIB1»»BR, OUTGM) 



$ If not conclmM Mod atocuo aod §o to SIIT. 

COHTX 

51 

80 — 

BIBtB«<R, OUTCHD. DMA mi4> 


52 

SO 

H(UC)<Ka»0IC} 


52 

818 «GBTFW 

ADBOinyiKPTB) 


I 



Table 4-11. Bus Controller tkcroprogram 


III III I II I I I .III B . >11 I ■ II I III! 1.1 I 

CROH LOCATKKI STATE PLA-ASD fSBHS CONTROL OOmiTS 



0 

SO 


RESET FI 

START 

1 

SI 

EX-»-H0LD 

C Wait for command from 
C SCCM 


2 

S2 

BBUSY^BNDI 

M(BCTAim)^RIN, AD»)UT<- 
DRIN 


3 

S3 

BZi^ABEND2 

M(PNT)x-DRIN^1 


4 

SO 

— 

DMA READ 

GETBCT 

5 

S4 

DMA BUSY->80LD 

C Wait for first BCT word 


6 

SO 


H(BCT1)-eDRIN 


7 

SO 

— 

1 IMIA READ, ADROUT-^KPHT) 
( M(PNT)<M(PNT)+1 


8 

S4 

DMA BUSY-mOLD 

t Wait in 2d BCT word. 

LADR 

9 

S26 

MINUS^INDIRECT 

H(BCT2)-H)RIN 


10 

SO 


M(PTR)-«-DRIN 


11 

SO 


1 DMA READ, ADROUT-«^(PNT) 
1 M(PNT)«eM(PNT+1) 


12 

S4 

DMA BUSY-»>H0LD 

e 

t Wait for 3d BCT word. 


13 

SO 

— 

CDR<(-DRIN, M(BCT3)-H)RIN 


C First 3 words of Bus Control Table moved to Hill memory 
C M(PTR) contains address of data message in SCCM 
C 1553A command in CDR to allow getting word count. 

14 SO — m(wc)-k:dr(wc) 

C Get the word count into M(HC) 

C Next decode terminal-terminal or controller-terminal. 

1 5 S5 MINUS->TT 


M(BCT1)«t^M (BCT1) 


Table 4-11. Bus Coatroller mcropxogram (Continuation 1) 


CBOM LOCATION 


16 

17 

REC 18 

19 

20 

21 

SYNCOUT 22 

23 

24 

25 

26 

CETST 27 

28 


SX4TB PLA-AND TBBHS CONTROL OUTPUTS 


t Branch if terminal to terminal, else controller/ 
t terminal. 

56 T-«>XMT BIB1B«I1(BCT3) 

C Branch if terminal ia to transmit to controller. 

57 ->REC 

C Transmission from controller to terminal. 


I I»fA READ, ADROUT->^(PTR) 
M(PTR)-^M(PTR)+1 

S4 DMA BUSY->HOLD C Wait for DMA of 1st data 

C word. 

SO CDR«*^(BCT 3), OUTCMD 

C OUTPUT 1S53A Receive Command. 


SO 

C Output first data word. 

58 R1-^21 

59 ZERO^ETST 

C If this is the last word 
SO 

S4 DMA BUSY-»>H0LD 

510 -*-SYNC0UT 

I R1*TOC2->-HOLD 

511 I 

( R1*TOC2h>ABEND3 

5 1 2 >CTOUT 


CDR«^DRIN, OUTDATA 

C Wait until Data is going 
C out. 

M(WC)-<-M(WC)-l 

wait for status. 

DMA READ, ADROUT-»M(PTR) 
M(PTR)-»-M(PTR)+1 

C Wait for DMA. 

CDR-«-DRIN, OUTDATA 

T0CN02 

M(STAT1)^CDR 


C ABEND3 - no status received. 


4-122 


9 ^. 

i 


Table 4-11. Ikts Coatrollar Microprogram (Continuation 2) 


CROM LOCATION STATE PU- 


CONTROL OUTPUTS 





29 SO — 

t Output 1553 Trananlt Connand. 

30 SI 3 ^->HOLD 


CDR«M(BCT3), OUTCHD 


C Halt for last output 
C cycle. 


t Now we drop back into the input mode. 


WAITSTAT 31 S1 1 


32 SO 


NXTDAT 33 SU 


34 SO 


R1*T0C2-^30 

T0C2*R?-*ABEND3 


I R2*TOC2-mOLD 
I TOC2*^->ABEND3 


35 SO 

36 S 1 5 ZERCM3UT 

37 SI 6 -*-NXTDAT 


T0CN02 c Walt for status 

C No status. 

M(STAT1)<K3)R 
t Save status 1. 

T0CN01 c Wait for data. 

Data missing 

I ADR0UT«^^1(PTR) 
M(PTR)'*+I(PTR)+1 
DMA WRITE, D0UT-K:DR 
M(WC)-^1(WC)-1 


( Above, Input Data Word, if WC>0, end 
C else wait for next word. 


38 SO 


39 S4 DMA BUSY-^HOLD 


I DMA READ, ADROUT*^t(PNT) 

(mCPND^WCPND+I 

C Walt for DHv of second 
C command. 


40 SO CDR'i-M(BCT3) , OUTCMD 

41 SO CDR-^DRIN, OUTCMD 

C Output Receive CMD followed by XMT command. 

42 SO C Walt one period. 



Table 4-11. Bua Controller Microprogram (Continuation 3) 


CBOM LOCATION STATE PLA-AND TEEMS CONTROL OUTPUTS 



43 

S17 

>UA1TSTAT-1 

SET FI 

OUT 

44 

S18 

PI-KTIOUT 



45 

S11 

|■RT•f6c^*«)LD 
( T0C2 *r!^BND3 

T0CN02 c Wait for second 
( status message. 


46 

SO 

— 

m(stat2)-k:dr 

C Save Status 2 


47 

S19 

^TTOUT 


TTOUT 

48 

SO 

— 

ADROUT>^M(STATLOC+3) 


49 

SO 

— 

DOUT<eM(STAT2), DMA WRITE 


50 

S4 

m BUSY-»HOLD 




C Write second status word for 

t-to-t transmission. 

CTOUT 

51 

so 


ADROUT-Hi ( STATLOC+2 ) 


52 

so 


DOUTvM(STATl), DMA WRITE 


53 

S4 

DMA BUSY-^HOLD 




C Write first status word for 

t-to-t transmission. 


54 

S20 

R^+3 

BIBIB^(STATl) 


55 

S21 

R->*+2 

BIBIB^(STAT2) 


56 

S22 

>URCSU 

DOUT-^M(MDOWN) 


57 

S22 

<4mcsw 

DOUT-^M(COMOK) 



C If 

status indicates SCCM OK, 

skip MDOWM. 

ABEND 1 

58 

S22 

>WRCSW 

D0UT«^M(BNA) 

ABEND2 

59 

S22 

HIRCSU 

D0UT+M(BACT) 

ABEND3 

60 

so 


DOUT-COMERR 


4-124 


Table 4^11. Bus Coat roller Mlcroprogran (Coatiaeatloa 4) 


CBOM LOCATION 

STATE PLA-AND TiOMS 


OQHTBOL ODTPDTS 

WRCSW 

61 

SO 

— 


1 ADBOOT4M(8TATLOC) 
|0MA HRITB 


62 

S4 

DNA BUSY-HlOLIj 




63 

SO 

— 

OOUT^(BCTADR) 


64 

SO 


1 

1 

1 ADR0UT«M(STATL0Cf1 ) 
(dMA write, OUTPUT RUPT 


65 

S23 

DMA BUSY-mOLD 



DONEXT 

66 

SO 

— 

A^(ONE) 


67 

S24 

ZERO^START-1 

M(BCT1)<A 


68 

SO 

— 


j ADROUT-Hi(PNT) , 

|dma read 


69 

S25 

DMA BUSY-^OLD 






DMA BUSY-*^START+1 





C if 

first word of BCT is 

odd, do next table 

INDIRECT 

70 

SO 

— 

M(BCT2)4-M(BCT2) 


71 

SO 

— 

1 

1 

1 ADR0UT-^M(BCT2) 
I DMA READ 


72 

S4 

DMA BUSY-mOLD 


— 


73 

S27 

-►LADR+1 

M(BCT2)-DRIN 



C If 

2d word of BCT is negative « 




C get Indirectly specified address 




Ofp 


4-125 




BIBLIOGBAPBY 


M0>B 67 J. B. Anderson and P. J. Hacrl* *tfailtlple redundancy 

applications in a computer," Proc. 1967 Ann. Svmpoelim on 
Beliability , Washington, D.C. , January 1967, 553-562. 

AVIZ 71b A. Aviaimiia, "Arithmetic error codeet Coat and effectiveness 
studies for application in digital system design," IBBB Trans- 
actions on Computers, vol. C-20, No. 11. November 1971, 
1iJi-i3ii“ 

AVIZ 71a A. Avisienis, et al., "The STAR (Self-Testing-And-Bepairing) 
computer: an Investigation of the theory and practice of 

fault-tolerant computer design," IfeEB Transactions on Computers. 
vol. C-20, no. 11, November 1971, 1312-1321. 

AVIZ 72 A. Avizienis and 0. A. Rennels, "Fault-tolerance experiments 
with the JPL STAR computer," Digest of COMPCON *72 (Sixth 
Annual lEBE Computer Society Int. Conf.), San Francisco, CA, 
1972, 321-324. 

AVIZ 75a A. Avizienis, "Architecture of fault- tolerant computing 

systems," Digest 1975 Int. Symposium on Fault-Tolerant Com- 
puting, Paris, France, June 1975, 3-16. 

AVIZ 75b A. Avizienis, "Paul t-tolerance and Fault- intolerance: comple- 
mentary approaches to reliable computing," Proc. 1975 
Int. Conference on Reliable Software . Los Angeles, CA, 

April 1975, 458-46<. 

AVIZ 77a A. Avizienis, "Fault-tolerance and longevity: goals for high- 

speed computers of the future," Proc. Symposium on High Speed 
Computer and Algorithm Organization. University of Illinois, 
Urbana-Champalgn, IL, April 1977. 

AVIZ 77b A. Avizienis and L. Chen, "On the Implementation of N-verslon 
programming for software faul t-tolerance during program 
execution," Proceedings 1977 Int. Computer Software and 
Applications Conference . Chicago, IL, November 1977. 

AVIZ 77c A. Avizienis, "Fault- tolerant computing: progress, problems 

and prospects," Proc. IFIP Congress 1977 . Toronto, Caimda, 
405-420. 

AVIZ 80 A. A. Avizienis, "Fault-tolerance: The survival attribute of 

digital systems," Proc. IEEE , vol. 66, No, 10, October 1978, 
pp 1109-1125. 

BARL 65 R. U. Barlow and F. Proschan, Mathematical theory of 
reliability, Wiley and Sons, New York, 1965. 


5-1 


BBUM 73 


BEUS 69 


BORG 72 


BOUR 69 


BOUR 71 


BREU 76 


BRIC 73 


BUTL 74 


CART 64 


CART 72 


CART 74 


CART 76 


I. 0. B«um» **8tandardlMttoa of avionics information aystoma,” 
System Devalopmant Corporation* Santa Monica* CA* 

TM-51 S9/000/(^. parformad for ARPA Inatituta for Dafanaa 
Analysis* August* 1973. 

H. J. Eauschar* at al.* ''Administration and malntananca plan 
of no. 2 BSS*" Tha Ball Systam Tachnical Journal, vol. AS* 
October 1969* 2765-2815. 

B. R. Borgarson* "A fail-softly syston for tima-sbaring usa*" 
Dlaast 1972 Int. Symposium on Pault-Tolarant Computins. 
j2a 19^i* 89-93. 

U. G. Bouricius* H. C. Carter* and P. R. Schnaidar* 

"Reliability modeling tachniques for salf-rapairlng computer 
systems," Proc. 2Ath National Conference of AQ!. 1969* 295-383. 

W. G. Bouricius* at al.* "Reliability modeling for fault- 
tolerant computers," IEEE Transactions on Computers , 
vol. C-20, November 1971, 1306-1311. 

M. A. Breuer and A. D. Friedman, Diagnosis and reliable design 
of digital systems . Computer Science Press, Inc., 

Woodland Hills, CA, 1976. 

J. L. Bricker, "A unified method for analyzing mission relia- 
bility for fault-tolerant computer systena," IEEE Transactions 
on Reliability , vol. R-22, no. 2, June 1973, 72-77. 

T. T. Butler et al., "LAMP; application to awitching-system 
development," The Bell System Technical Journal , vol. 53, 
no. 8, October 1974, 1535-1555. 

W. C. Carter, et al., "Design of serviceability features for 
the IBM system 360," IBM Journal of Research and Development, 
vol. 8, no. 2, April 1964, 115-125. 

W. C. Carter et al., "Computer error control by testable 
morphic boolean functions - a way of removing hardcore," 

Dig, of the 1972 International Symposium on Fault-Tolerant 
Computing . Newton, MA, IEEE Computer Society, June 1972, 
154-159. 

W. C. Carter, "Theory and use of checking circuits," Computer 
Systems Reliability , Infotech Information Ltd., 1974 
(Maidenhead, England), 413-454. 

W. C. Carter and C. E. McCarthy, "Implcaientation of an experi- 
mental fault-tolerant memory system," IEEE Transactions on 
Computers, vol. C-25, no. 6, June 1976, 557-568. 



CART 77 W. C. Carter* at al.* **Coat effactlveneas of aelf^checklog 
ccanputar design*" In Dig. 1977 Int. 8yro>» Fanlt-Telerant 
Computing (Los Angalaa. CA) * pp 1l'^-123* June 197). 

CHAN 74 H. Y. Chang* G. H. Smith* Jr.* and R. B. Malford* "LAMP: 
system description*** The Bell Syetem Technical Journal , 
vol. 53, no. 8* October 1974* 1e3l-1449. 

CONN 72 R. B. Conn, N. A. Alexandridia and A. Aviaienis* **Daaign of a 
fault-tolerant modular fMMnputer with dynamic redundancy," 

AFIPS Conference Proc. vol. 41* Fall JCC 1972* 1057-1067. 

COOP 76 A. E. Cooper and W. T. Chow* *^velopment of on-board space 
computer systems*" IBM Journal of Research and Development , 
vol. 20, no. 1, January 1976* 5-19. 

CORB 72 F. J. Corbato, J. H. Saltser* and C. T. Clingen, "Multics: 

the first seven years," AFIPS Conference Proceedings, vol. 40* 
1974, 571-583. 

DOWN 64 R. W. Downing, J. S. Nowak, and L. S. Tuomenoksa, "No. 1 ESS 
maintenance plan," The Bell System Technical Journal , vol. 43, 
no. 5, part 1, Sept^nber 1964, 1961-2019. 

EJCC 33 Information processing systems - reliability and requlronents, 
Proc. of the Eastern Joint Computer Conference . Washington, 

D.C., December 1953. 

EVER 57 R. R. Everett, C. A. Zraket, and H. D. Benington, "SAGE - A 
data-processing system for air defense," Proc. Eastern Joint 
Computer Conference , Washington, D.C., December 1957, 148-155. 

GOLD 75 J. Goldberg, "New problems in fault-tolerant computing," 

Digest 1975 Int. Symposium on Fault-Tolerant Computing. Paris, 
France, June 1975, 29-34. 

HAME 72 K. J. Hamer-Hodges, "Fault resistance and recovery within 

System 250," Proceedings of I.C.C. Conference . Washington, D.C., 
October 1972. 

HECH 76 H. Hecht, "Fault-tolerant software for real-time applications," 
ACM Computing Surveys , vol. 8, no. 4, December 1976, 391-407. 

HOPK 75 A. L. Hopkins, Jr. and T. B. Smith III, "The architectural 

elements of a symmetric fault-tolerant multiprocessor," IEEE 
Transactions on Computers , vol. C-24, no. 5, May 1975, 

498-505. 

HOPK 78 A. L. Hopkins, Jr., et al., "FTMP - A highly reliable 

fault tolerant multiprocessor for aircraft," Proc. IEEE , 
vol, 66., no, 10, October 1978, pp. 1221-1239. 


5-3 


IBM 67 An appllcntloi^rlnntnd miltiprocMslng sy8ta>» IBM Syatane 
Jonmnl t vol. 6* no. 2, 1967* 

ICR8 75 ProcnedlMS of tha 1975 Int» Confaranca on Bnliable Softwarn. 
Loa AngalM* CA» April 1^75. 

IRE 53 Baaaion Ut Sympoalun - diagnoatlc programa and marginal 

chacking for largo ecalo digital eomputarot Convention Rocord 
of the IRE 1953 National Convention, part 7» Mew York» M.Y., 
March 1953, 48-71. 

KATS 78 D. Natauki, et al. , **Pluribua - An operational fault-tolerant 
tottltiproceaaor," Proc. IEEE, vol. 66, no. 10, October 1978, 
pp. 1146-1159. 

LESH 76 H. F. Leah, et al. , "Software techniquea for a dlatrlbuted 

real-time procesalng ayatem," in Proc. IEEE National Aeroanace 
and Electronlca Conf . (Dayton, OH), pp. 290-295, May 1976. 

LEVY 75 H. 0. Levy and R. B. Conn, "A almulatlon program for relia- 
bility prediction of fault tolerant aystems," Digeat 1975 
Int. Sympoalum on Fault-Tolerant Computing. Paria, France, 

June 1975, 104-109. 

LOMD 75 R. L. London, "A view of program verification," Proc. 1975 

Int. Conference on Reliable Software, Loa Angelea, CA, April 
1975, 534-545. 

MAIS 71 F. P. Malaon, "The NECRA: a aelf-repalrable computer for 

highly reliable proceaa," IEEE Tranaactlona on Computers , 
vol. C-20, no. 11, November 1971, 1382-1393. 

MATH 70 F. P. Mathur, and A. Avlzlenia, "Reliability analysis and 

architecture of a hybrid-redundant digital system; generalized 
triple modular redunciancy with self-repair," AFIPS Conference 
Proceedings, vol. 36, 1970, 375-383. 

MATH 75a F. P. Mathur, and P. T. de Sousa, "Reliability modeling and 
analysis of general modular redundant systems," IEEE Trans- 
actions on Reliability, vol. R-24, No. 5, December 1975, 
296-299. 

MATH 75b F. P. Mathur, and P. T. de Sousa, "Reliability models of 

systems," IEEE Transactions on reliability , vol. R-24, no. 2, 
June 1975, 108-112. 

MCPH 76 J, A. McPherson and C. R. Klme, "A two-level approach to 

modeling system dlagnosablllty," Proc. 197b Int. Symposium on 
Fault-Tolerant Computing, Pittsburgh, PA, June 1976, 33-38. 

MERA 76 C. Meraud and F. Browaeys, "Automatic rollback techniques of 
the COPRA computer," Proc. 1976 Int. Symposium on Fault- 
Tolerant Computing , Pittsburgh, PA, June 1976, 23-29. 


5-4 


MOOR 56 
MORA 75 
NBLS 75 
NGYW 76 

NGYW 77a 

NGYW 77b 

OBLO 62 

ORNS 75 
PARK 74 

PARK 76 

R/VND 75 

RAM/X 72 


B. F. Moora and C. B. Shannon, "Raliabla clrculta using lass 
rallabla ralays," Journal of tha FraiUU.ln Instltuta . vol. 262, 
noa. 9 and 10, Saptanbar, Octobar 1956, 1^1-20^ ami 281-297. 

P. B. Moranda, ''Production of softwara roliability during 
dabugging," Proc« 1975 Annual Roliability and Maintainability 
Synpoalum, January 1975, 327*332. 

B. C. Nalson, "Softwaro reliability," Diaaat 1975 Int. 

Syntpoaiun on Fault-Tolarant Coaiputinn . Par la. Franca, June 
1975, 24-28. 

Y. «W. Ng and A. Aviaianla, "A nodal for tranaiant and pema- 
nent fault reccvary in cloaad fault^tolarant aystoma," 

Proc. 1976 Int. Synpoalum on Fault-Tolerant Comnutina. 
Pittsburgh, PA, June 1976, 182-188. 

Y. -W. Ng and A. Avizlenis, "A reliability model for grace- 
fully degrading and repairable fault-tolerant systems," Proc. 
1 977 Int. Synpoalum on Fault-Tolerant Comoutine. Los Angeles, 
CA, June 1977, 22-28. 

Y. -W. Ng and A. Avizlenis, "ARIES - an automated reliability 
estimation system," Proc. 1977 Annual Reliability and 
Maintainability Synpoalum, Philadelphia, PA, .lanuary 1977, 
108-113. 

.1. Oblonsky, "A self -correcting computer," Digital Inform.ition 
Processors, W. Hoifman, ed.. Interscience Publishers, New York, 
1962, 533-542. 

S. M. Ornstein, et al., "Pluribus - a reliable multiprocessor, 
AFIPS Conference Proceedings ." vol. 44, 1975, 551-559. 

B. Parhaml and A. Avizlenis, "A study of fault-tolerance 
techniques for associative processors," AFIPS Cunference 
Proceedings , vol, 43, 1974, 643-652. 

K. P. Parker, "Compact testing: testing with compressed 

data," Proc. 1976 Int. Symposium on Fault-Tolerant Computing. 
Pittsburgh, PA, Jute 1976, 93-98. 

B. Kandell, "System structure for software fault tolerance," 
IEEE Transactions on Software Engineering, vol. SE-1, no. 2, 
June 1975, 220-232. 

C. V. Ramamoorthy and L. C. Chang, "System modeling and testing 
procedures for microdiagnostics," lEKE Transactions on 
Computers . vol. C-21, no. 11, November 1972, 1169-1183. 


RENN 7U 

RENM 73b 
RENN 76 

RENN 78a 

RENN 78b 

RENN BOa 
RENN 80b 

SHOO 73 

SHOR 68 
SIEW 77 
SKL^ 76 
STIF 76 


D. A. Rcnncla and A. Avlaianis* "IMSt A reliability nodal Ing 
ayaton for aalf^rapairlng conpatara,** Dinaat of tha Third 
Intarnatlonal Synpoalum on Pault»Tolarant Conoutlna * 

Palo Alto, CA, June 1973, 131-135. 

D. A. Rannala, "Fault datactlon and racovary in redundant 
conputar uaing atandby aparaa," Tacluiical Report UCLA-ENC-7355 . 
University of California, Loa Angelas, CA, June 1973. 

D. A. Rannala, at al., "Tha unified data systan: A distri- 

buted processing natuork for control and data handling on a 
spacecraft,” in Proc. IEEE National Aaroapaca and Electronics 
Conf . (Dayton, OH), pp. 283-289, Hay 1976. 

D. Rennals, Fault-Tolerant buildina-block computer study. 

JPL Publication 78-67, Jet Propulsion Laboratory, California 
Institute of Technology, Pasadena, CA, July 1978. 

D. Rennels, "Architectures for fault tolerant spacecraft 
computers," Proc. IEEE. Vol. 66, No. 10, October 1978, 
pp. 1255-1268. 

D. A. Rennels, "Distributed Fault-Tolerant Computer Systems," 
Computer , Vol. 13, No. 3, March 1980, pp 55-65. 

D. Rennels and M. Buchwalter, "Selective redundancy in a 
building-block distributed computing system," Dig. Government 
Microcircuit Applications Conference . (Houston, TX), to be 
published November 1980. 

M. L. Shooman, "Operational testing and software reliability 
estimation during program development," Proc. 1973 IEEE 
Symposium on Computer Software Reliability . New York City, 

1973, 51-56. 

R. A. Short, "The attainment of reliable digital systems 
through the use of redundancy - a survey," IEEE Computer 
Group News , vol. 2, no. 2, March 1968, 2-17. 

D. Siewlorek, M. Canepa and S. Clark, "C.vmp: The architecture 

of a fault-tolerant multiprocessor," Proc. 1977 Int. Symposium 
on Fault-Tolerant Computing . Los Angeles, CA, June 1977. 

J. R. Sklaroff, "Redundancy management technique for space 
shuttle computers," IBM Journal of Research and Development , 
vol. 20, no. 1, January 1976, 20-28. 

J. J. Stiffler, "Architectural design for near- 1001 fault 
coverage," Proc. 1976 IEEE International Symposium on Fault- 
Tolerant Computing , June 21-23, 1976, Pittsburgh, PA. 


5-6 


8ZYG 76 8. A. 88jrg«nda and B* H. fhon^Mt **Modaling and digital 

aisulation for daaign variCication and diagnoaia*** IBBB 
Tranaactiona on Conootara. vol. C-25t so* 12» Dacanonr I876, 
1242-1233. 

TAND 76 Tandan 16 ByetM l^acription. Tandan Conputara lne.« Cupartlno, 
CA, OcCobar li^6. 

TANG 69 D. T. Tang and B. T. Chi«i( **Goding for arror control,** IBM 
8yat«na Joomal . vol. 8, no. 1, 1969, 48-86. 

TAYL 73 P. 8. Taylor, **A reliability and conparatlvo analyaia of two 
standby systam configurations,** 1BB8 Tranaactiona on 
Reliability, vol. R-22, April 1973, pp. 13-15. 

TOYW 78 W. N. Toy, *'Fault-tolerant doalgn of local ESS proceasoro,*' 
Proc. IEEE, vol. 66, no. 10, October 1978, pp. 1126-1143. 

ULTR 74 Reconflgurable computer syatea study . Ultrasystcsis, Inc., 

Newport Beach, CA, perforaed for NASA Langley Research Center, 
1974. 

VONN 56 J. Von Neumann, "Probabilistic logics and the synthesis of 
reliable organlsns from unreliable components," Automata . 
Studies, C. E. Shannon and J. McCarthy, eds., Ann. of Nath. 
Studies No. 34, Princeton University Press, 1956, 43-98. 

WAKE 74 J. F. Wakerly and E. J. McCIuskey, "Design of low-cost general- 
purpose self -diagnosing computers," Information Processing 74 , 
North-Holland Publ. Co., Asisterdam, 1974, 108-111. 

WENS 72 J. H. Wensley, "SIFT - software implemented fault- tolerance," 
AFIPS Conf. Proc . vol. 41, part 1, 1972, 243-254. 

WENS 76 J. H. Wensley et al., "The design, analysis and verification 
of the SIFT fault- tolerant system." Proc. Second Int. 

Conference on Software Engineer ina. San Francisco, CA, 

October 1976, 458-469. 

WENS 78 J. H. Wensley, et al., "SIFT: Design and analysis of a fault- 

tolerant computer for aircraft control," Proc. IEEE , vol. 66, 
no. 10, October 1978, pp. 1240-1255. 

WYLE 67 Wyle, H. and G. Burnett, "Some relationships between failure 
detection probability and computer system reliability," AFIPS 
Conference Proceedings. Fall Joint Computer Conference, 1967, 
745-756. 


3-7 


Sttpplwwnt 


KILP 72 P. 8. Kilpatrick at al.» All aanieoiiductor diatributad aaro- 
apaca procaaaor/aan»ry at<^y» Volam 1 Avionica Procaasing 
(taquiraoMOta, HonayiMll, Inc., APAL TR«»72 . parfomad for APAC, 
Wrighc-Pattaraon Air Porca Baaa, Ohio* 1972. 

15S3A . Aircraft Intaraal Ttaa biviaicw Coaaand/kaaponaa thiitiplaa 
Data Baa . DoD Military Standard 1353A* April 1975* 

0. Govamaant Printing Of flea 1 975-603-767-/ U 72. 


{ 


5-8 


APPENDIX 


I/O BUILOINC BUXX8 




I/O BUILDING BLOCKS 


Input*-output requirements of host systems vary widely in 
voltage ranges, currents, and timing parameters. The approach best 
suited to building-block development is to provide a standard set of 
functions which serve a majority of general applications. The user is 
required to supply any additional special functions unique to his 
applications. 

To be consistent with the FTBBC computer modules, all building 
blocks must provide memory-mapped I/O. This is, each I/O building block 
must recogni its identification and the function being requested from 
an out-of-rs’ ^e address appearing on the host computer's address bus* 
Data for output or input is transferred over the data bus In response to 
a write or read to the specified I/O address. 

A second set of requirements is related to fault-tolerance. 

The I/O building block must check incoming addresses and data for proper 
coding, and utilize duplication or coding checks to verify proper func- 
tioning of its internal logic. Either an error in incoming data or 
detection of an internal fault must signal an error indication to the 
CORE Building Block. This internal error indicator should be a morphic 
(one-out-of-two) coded signal to prevent a single point failure. 

Finally, the building block must encode incoming data for presentation 
on the host computer's bus. 

Typical I/O Functions 

The following is a listing of I/O functions which should be 
supplied by building-block modules. One special feature is important in 
achieving synchronization in voting configurations, as well as 
decoupling I/O timing from detailed instruction timing in the Terminal 
module* This is a feature which creates a granularity in I/O timing by 
synchronizing outputs and if-puts with the Real-time interrupt which 
drives the computer system. Specifically, a Real Time Interrupt (RTI) 
input is provided with each building block, and which typically provides 
a pulse every few milliseconds. All output commands are held within the 
[ building block, and are executed at precisely the next RTI. Similarly, 

> inputs are sampled and held through an RTI period. 


A-2 


fAMO • syncbroootts wwetttlvtt Is ^ ^ tsfpissl 

Module, this tsctoiqiM sUsms hs ‘ chsstsdl s haa g ia g 

tha I/O ti«lss of unmodiflwi f^rogrsffls It also pv^rmte 

0H& activity from the intercoaonmicatloiis bus from changing t/0 timing 
due to slight vsristicms in processor speed due to stolsm memory cycles. 
Finally a restricted interaction with the host syst«D coupliul with 
synchronous software operation is expected to simplify verification and 
validation. 

(1) I/O Function #1 Parallel Beta Out. Outputs a 16->bit 
data word taken from the host computer*s data bus at the 
next RTI pulse. 

(2) I/O Function #2 Parallel Data Input . Sample and hold a 
16-bit data word at the next RTI pulse. A separate Read 
Command transfers the sampled data into the host computer. 

(3) I/O Functi<m #3 Serial Bata Out . Shifts out a 16-blt 
data word at the next RTI pulse. Provides word gate and 
shift clock signals. 

(A) 1/0 Function #4 Serial Data In . Accepts up to 16 bits 

of serial digital data from a data source. 

(5) 1/0 Function #5 Pulse and Bilevel Input . This function 
is used to sense the logical state of up to 8 lines, and 
to sense the occurrence of a pulse event within a software 
determined measurement period. The pulse sensing logic 

is reset on RTI (or multiples of tal) time centers idiile 
the level sense logic is allowed to change state on 
1 usee intervals. 

(6) 1/0 Function #6 Pulse Counter . This function totals 
the nuiid>er of pulse events over a predetermined time 
Interval. 


O) I/O §7 mwuhU 

(Modulo g CoM»t»g) « Osod to SMkoratft pulM stMana 
which ara iafcagral adbmaltiglaa of a Neatar Clock. 

(S) I/O yunctl<m #8 Pulaa Outaut . (^leratao fnilaaa with 
dalay and width program-apecifiad and darivad from a 
Mastar Clock. Pulaaa ara generated periodically on Mn 
time centera and are tsrpically of lOy aec or lOOu sec 
duration. 

(9) I/O Function #9 Analog Multiplexor . Up to 16 linea of 
analog data can be collected, in a two-part operation. 
First the desired analog line is selected and the data ia 
quantised at the next RTI time. The resulting digital 
data ia held in an output register until it is retrieved 
by software with a subsequent read operation. 

(10) I/O Function >10 High Rate DMA. This function is 
designed to minimise handling of high rate data. A 
starting address and word count is loaded into the 
building block along with an output or receive request. 
Data is transferred to or from the host computer memory 
under the control of a peripheral device. 

These functions were selected to provide a general I/O capa- 
bility. The Functions are made sufficiently powerful so that the burden 
of high rate timing can be removed from software. In general, the soft- 
ware only has to provide outputs with a resolution of a few milliseconds 
(determined by the RTI) and the hardware takes care of the finer details. 

In order to proceed with I/O building block design, a detailed 
analysis of NAVY systems and procedures is required. However, the 
following general comments can be made regarding building block 
implementation. 

Implementation Strategy 

The circuitry for each I/O function is not complex and the 
implementation of fault-detection is straightforward. Where information 
structure is preserved (such as data in and out) parity checking can be 
employed. Control functions can be duplicated with morphic comparison. 


The density of VLSI technology Is sufficiently high that a 
number of I/O fimctlons can be placed on a single chip* The specific 
function which Is required can be activated by connecting pins. This 
technique can reduce the Inventory of building blocks to two or three. 
Most of the fimctlons described above can be Inplonented on a single 
chip. 

One additional requlremmit Is for the redundant use of I/O 
elements. To achieve redundancy In Terminal tfodulea. two or more 
modules are cross-strapped, l.e.. their Inputs and outputs are hooked 
together. One module is powered and the others are used as unpowered 
standby spares. When cross-strapped I/O is used. It Is Important that 
s|K>rt-protection be provided at all output connections. Otherwise a 
shorted I/O connection could Inactivate all the spares. Typical 
techniques for protection are to isolate outpuLS with series diodes 
and inputs with series resistors. Thus, hybrid isolator packages will 
be required as an Integral part or as an adjunct to the building blocks. 

I/O Building Block definition is an area recommended for 
further study in the areas of (1) a detailed definition of NAVY func- 
tional requirements and (2) chip development. 


